Full Text

Turn on search term navigation

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Federated learning (FL) [1, 2] has risen as a groundbreaking subdomain of machine learning (ML) that enables Internet of Things (IoT) devices to contribute their real-time data and processing to train ML models. FL represents a distributed architecture of a central server and heterogeneous clients, aiming to reduce the empirical loss of model prediction over nonindependent and identically distributed (non-IID) data. In contrast to traditional ML algorithms that require large amounts of homogeneous data in a central location, FL utilizes on-device intelligence over distributed data [3, 4]. The limited feasibility of ML in industrial and IoT applications is overturned by the introduction of FL. Some potential applications of FL include Google keyboard [5], image-based geolocation [6], healthcare informatics [7], and wireless communications [8].

Each round of a FL paradigm constitutes of client-server communication, local training, and model aggregation [9]. The communication overhead is usually due to model broadcast from the server to all clients and vice versa. In every communication round, there is a feasibility risk in terms of limited network bandwidth, packet transmission loss, and privacy breach. In the growing applications of Industrial Internet of Things (IIoT), where the communication is Machine to Machine (M2M) these parameters may be static, making efficiency in data transfer important. Modified communication algorithms [10] use compression and encryption to reduce the model size and protect privacy. Communication load is also determined by the number of edge devices. Sparsification of communication [11] implemented over clients is modeled to increase convergence rate and reduce network traffic on the server. Many models also utilize hierarchical clustering [12] to generalize similar client models and reduce the aggregation complexity.

Apart from communication, training ML models in a heterogeneous setup presents a huge challenge [13]. Once the server model is broadcast, the clients train on it considering some hyperparameters such as client ratio (i.e., from a strength of 100, number of clients chosen), learning rate, batch size, and epochs per round. With each edge device, computational power and properties of data (ambiguity, size, and complexity) vary drastically, and diversely trained client models are hard to aggregate. In a realistic scenario of thousands of edge devices, the updated global model may not converge at all. Existing aggregating algorithms such as FedAvg and FedMA [14] focus more on integration of weights of the local models. Convergence rate and learning saturation are common concerns when it comes to training and aggregation. Several novel approaches work around model aggregation either by using feature fusion of global and local models [15] or by a grouping of similar client models [16] to increase generalization. Some literatures also utilize multiple global models to better converge data [17].

Research on making FL models adaptive to non-IID data has focused primarily on model aggregation. Local training of the model itself is an undermined step, given its role in the final accuracy. In this paper, we propose three novel contributions to lessen the empirical risk in FL, as shown in Figure 1:

(i) Clustering of clients solely based on model hyperparameters to increase the learning efficiency per unit training of the model

(ii) Implementation of density-based clustering, i.e., DBSCAN, on the hyperparameters for proper analysis of devices properties

(iii) Introduction of genetic evolution of hyperparameters per cluster for finer tuning of individual device models and better aggregation

[figure omitted; refer to PDF]

In particular, we introduce a new algorithm, namely, Genetic CFL, that clusters hyperparameters of a model to drastically increase the adaptability of FL in realistic environments. Hyperparameters such as batch size and learning rate are core features of any MFL model. In truth, every model is tuned manually depending on its behavior to the data. Therefore, in a realistic heterogeneous setup, the proper selection of these parameters could result in significantly better results. DBSCAN algorithm is used since it is not deterministic, static in terms of cluster size and uses neighbourhood of model hyperparameters for clustering. We also introduce genetic optimization of those parameters for each cluster. Genetic algorithm is algorithm since it is highly application flexible and scalable to higher dimensions. As defined, each cluster of clients has its own unique set of properties (i.e., hyperparameters) that are suitable for the training of the respective models. In each round, we determine the best parameters for each cluster and evolve them to better suit the cluster.

The rest of this paper is organized as follows. Section 2 discusses the recent work done in the fields of FL, clustering, and evolutionary optimization algorithms. The proposed algorithm is defined in Section 3, followed by the results in Section 4. Finally, the paper is concluded in Section 5.

2. Related Work

In this section, we survey the current literature on the topics of FL, density-based clustering, and evolutionary algorithms, respectively, and try to understand their limitations.

2.1. Federated Learning

Recently, FL as a distributed and edge ML architecture is being studied extensively [1, 18]. This decentralized nature of FL directly contradicts traditional ML algorithms which are genuinely difficult to train in a heterogeneous environment consisting of non-IID data. Novel approaches have tried to overcome this difficulty through various model aggregation algorithms, namely, FedMA [14], feature fusion of global and local models [15], and agnostic FL and grouping of similar client models [16] for better personalization and accuracy. Clustering takes advantage of data similarity in various clients and models [19] and efficient communication, and lastly improves global generalization [20]. In general, much work is yet to be done in terms of efficient model training on non-IID data.

2.2. Density-Based Clustering

Clustering in FL is primarily used for efficient communication and better generalization. In a realistic scenario with thousands of nodes, aggregating everything into a single model may damp the convergence greatly. Several partitioning, hierarchical, and density-based clustering algorithms have been applied to work on some of the problems existing in FL. Partitioning clustering algorithms such as k-means clustering [21] demand a predetermined number of clusters, but in actuality that is not feasible. Some examples of nondefinitive clusters include agglomorative hierarchical clustering [22] and generative adversarial network-based clustering. In this paper, we propose to use DBSCAN (density-based spatial clustering of applications with noise) [23], a density-based clustering algorithm that only groups points if they satisfy a density condition.

2.3. Evolutionary Algorithms

Hyperparameters of a model determine their ability to learn from a certain set of data. Optimization of ML models and their hyperparameters using evolutionary algorithms [24] such as whale optimization [25] and genetic algorithms [26] are explored by many researchers. In addition, these algorithms have been extensively used over DL frameworks that have become a trend for optimization tasks [27]. The same has yet not been adopted for FL extensively. Also, algorithms such as reinforcement learning (RL) with focus on Q-Learning are not suitable for highly complex scenarios [28]. The need for hyperparameter tuning increases even more in FL due to the ambiguity in data, and the abovementioned optimization algorithms assist in tuning those parameters beyond manual capacity. Since optimization of each client model parameters is not feasible, we propose to do so for each cluster.

Through the survey, we observe that FL is greatly limited by efficiency of individual client training that includes apt choice of hyperparameters, increasing adaptive nature of the models and optimization of such process.

3. Genetic CFL Architecture

In this section, we give a detailed mathematical model of our algorithm, genetic CFL. The complete pipeline is divided into two parts, the initial broadcast round represented by Algorithm 1 to determine the clusters and the federated training using genetic optimization represented by Algorithm 2. The variational behavior of the algorithm with different hyperparameters, including client ratio ( $n$ ), number of rounds, $ϵ$ , minimum samples, learning rate ( $η$ ), and batch size is explained in this section. Table 1 elucidates all the symbols utilized in the algorithm.

Table 1

Symbol representations.

Symbol	Meaning
n	Number of clients
$η$	Learning rate
B	Batch size
$C_{i}$	$i^{th}$ client
$w^{0}$	Model weights
$w_{n}^{0}$	Model weight of $n^{th}$ client

Algorithm 1: Initial broadcast round and clustering.

n = Number of Clients

$η$ : learning rate

$η_{m}$ : learning rate list

(1) procedure Server

(2) $w^{0}$ : Server Model Initialization

(3) Initialize $η_{m}$ with learning rates $1 e - 1,1 e - 6$

(4) Broadcast ( $w^{0}$ , random ( $η_{m}, 3$ ))

(5) procedure Client

(6) $i \leftarrow 0$

(7) while $i \neq n$ do

(8) $j \leftarrow 0$

(9) while $j \neq 3$ do

(10) Train $w_{j}^{0}$ on $η_{j}$

(11) $losses \leftarrow loss w_{j}^{0}$

(12) j = j + 1

(13) i = i + 1

(14) return $w_{\min^{0}}, η_{\min}, {losses}_{\min}$

(15) procedure Server

(16) $w^{0} \leftarrow \sum w_{n}^{0} / n$

(17) Initialize DBSCAN Clustering Algorithm

(18) $ϵ \leftarrow 1 e - 6$

(19) ${points}_{\min} \leftarrow 2$

(20) $model \leftarrow DBSCAN ϵ, {points}_{\min}$

(21) clusters = model.fit_predict ( $η_{n}$ )

Algorithm 2: Genetic optimization based FL on clustered data.

rounds: Number of loops for training the federated model

(1) function Mutate ( $η$ )

(2) factor $\leftarrow$ random ( $- 1,0,1$ )

(3) $η \leftarrow η + η \times factor / 10$

(4) return $η$

(5) function Crossover ( $η_{n}$ )

(6) Initialize temporary array $η_{temp}$ to store $η$

(7) $η_{temp} 0,1 \leftarrow η_{n} 0,1$

(8) for $k \leftarrow 2$ to $size η_{n}$ do

(9) ${parent}_{A}, {parent}_{B} \leftarrow random 0, size η_{n}$

(10) $η_{temp} k \leftarrow Mutate η_{n} {parent}_{A} + η_{n} {parent}_{B} / 2$

(11) return $η_{temp}$

(12) function Evolve ( ${losses}_{n}, η_{n}$ )

(13) ${losses}_{n}$ , $order \leftarrow sort {losses}_{n}$

(14) $η_{n} \leftarrow sort η_{n}, order$

(15) return Crossover ( $η_{n}$ )

(16) procedure Train

(17) $len \leftarrow size cluster$

(18) $ind \leftarrow 0$ to len

(19) Initialize $η_{global}$ with shape (len, size (clusters $ind$ )

(20) ${clusters}_{unique} = unique cluster$

(21) for $i \leftarrow 0$ to rounds do

(22) for $k \leftarrow 0$ to $size cluster$ do

(23) ind = [clusters.index (cluster[i])]

(24) $η_{global} k \leftarrow EVOLVE losses ind, η_{n} ind$

(25) Empty arrays losses, $η_{n}$

(26) for $k \leftarrow 0$ to n do

(27) $w_{k}^{0} k$ , losses[k], $η_{n} k = train w^{0}, η_{global} clusters k clusters k \cdot nextIndex$

(28) $w^{0} \leftarrow \sum w_{m}^{0} / n$

The purpose of Algorithm 1 is to discreetly determine the data characteristics of an edge device without intruding on their privacy. A server model ( $w^{0}$ ) is initialized and broadcast to $n$ clients, $C \subseteq C_{0}, C_{1}, \dots, C_{tot}$ . With each distributed model, three different $η$ are broadcast. The sample size is chosen to introduce variance in training, while more number of samples can also be used for experiments. These learning rates are chosen from an array ( $η_{m}$ ) ranging from $1 e - 1,1 e - 5$ . Each edge device receives $w^{0}$ that is cloned for all values of $η$ and trained individually for a single epoch. Data properties unique to an edge device such as size, complexity, ambiguity, and variance drastically affect the training, and thus, hyperparameters of a model, $η$ , batch size, are chosen accordingly. Naturally, from the three trained models in an edge device, the one with the least loss, denoted as $w_{\min}^{0}$ , is chosen. Each edge device then returns $w_{\min}^{0}$ , $η_{\min}$ , and ${losses}_{\min}$ . The significance of these values is their data representative capacity of the respective edge devices.

At server, the models $w_{0}^{0}, w_{1}^{0}, \dots, w_{n}^{0}$ , the learning rates $η_{0}, η_{1}, \dots, η_{n}$ , and their respective losses are attained. The model aggregation technique is used to obtain the server model by combining edge device models. The weights of the models ( $w_{n}^{0}$ ) are summed iteratively as follows: $\begin{matrix} (1) & \sum_{i = 0}^{n} w_{i}^{0} = w_{0}^{0} + w_{1}^{0} + w_{2}^{0} + \dots + w_{n}^{0} . \end{matrix}$

After summation, the output of the equation is divided by the number of clients to obtain model aggregation as $\begin{matrix} (2) & w^{0} \leftarrow \frac{\sum w_{n}^{0}}{n} . \end{matrix}$

After server model aggregation, the DBSCAN clustering algorithm is applied. In a realistic scenario, the number of edge devices and their variance cannot always be determined. In deterministic partitioning clustering methods such as K-means clustering, the number of clusters has to be predetermined and is not dynamic. DBSCAN, on the contrary, uses density-based reasoning for the grouping of similar objects. It takes two mandatory inputs, $ϵ$ and min samples. Any point $x$ forms a cluster if a minimum number of samples lie in its $ϵ -$ neighbourhood. This value can be calculated by $\begin{matrix} (3) & A_{ϵ} \equiv x \in {HP}_{space}, dist x, y < ϵ, \\ (4) & {HP}_{space} η \in 1 e - 3,1 e - 7, \\ (5) & {HP}_{space} batch - size \in 16,128 . \end{matrix}$

Here, ${HP}_{space}$ represents the domain in which the point $x$ must be presented. In our case, it is the range of hyperparameters, specifically learning rate $η$ . Each $ϵ$ -neighbourhood must contain a certain number of points (MinPts) to be called a cluster as follows: $\begin{matrix} (6) & A_{ϵ} > MinPts, \\ MinPts \in ℕ . \end{matrix}$

In the object space of only learning rate, $η_{2} - η_{1}$ gives the Euclidean distance used for $ϵ$ neighbourhood. When the number of dimensions is increased with the addition of batch size (B) the Euclidean distance formula for the 2-coordinate system is used and logarithmic values of hyperparameters are taken to scale the exponential values to liner ones. The calculation can be observed as $\begin{matrix} (7) & ϵ = \sqrt{{\log_{10} \frac{η_{2}}{η_{1}}}^{2} + {\log_{2} \frac{B_{2}}{B_{1}}}^{2}} . \end{matrix}$

After each edge device is allotted a cluster-ID, we implement phase-2, shown by Algorithm 2. This section of the algorithm works under the main control loop which runs for $i$ rounds. In every $i$ th iteration,

(1) Hyperparameters are optimized per cluster using genetic algorithm involving evolution followed by crossover and finally mutation

(2) The server model with optimized hyperparameters is broadcast to each client clusterwise

(3) Each client is trained based on said parameters

(4) Client models are aggregated to form the latest server model

Every cluster has a different set of characteristic hyperparameters suitable to the edge devices belonging to them. These clustered parameters are evolved genetically followed by training for every $i$ th round. Using genetic optimization for tuning converges the set of hyperparameters to an optimal set each round. $η_{global} k$ is initialized that stores learning rates for each cluster, and its contents are modified every round. It is of shape $C, size C_{i}$ , where $C$ is the number of clusters, $C_{i}$ represents the $i th$ cluster, and $size C_{i}$ represents the number of edge devices in each $i_{th}$ cluster. The hyperparameters of a cluster having shape $m_{0}$ are sorted through their losses: $\begin{matrix} (8) & {losses}_{n}, order \leftarrow sort {losses}_{n}, \\ η_{n} \leftarrow sort η_{n}, order . \end{matrix}$

Once sorted, we obtain new individuals through crossover and mutation, respectively. The best individuals (hyperparameters in a cluster) retain their genes and are promoted to the next generation (round), while the others are formed by mating of individuals from the last generation as $\begin{matrix} (9) & η_{new} = η_{0}, η_{1}, \dots, η_{size C m} . \end{matrix}$

The new learning rates $η_{new}$ are chosen either directly or by mating. The number of $η$ taken from old generation can vary. From (9), we derive the modified parameters: $\begin{matrix} (10) & η_{new} = η_{old} 0, η_{old} 1, \dots, \frac{η_{old} P_{A} + η_{old} P_{B}}{2} 1 + \frac{f}{10}, \end{matrix}$ where $P_{A}, P_{B} \in 0,9$ and $f \in - 1,1$ .

After genetic evolution, the server model is again broadcast to all devices with their respective cluster hyperparameters. Each edge device trains for 1 epoch, and the complete process of genetic optimization, training, and model aggregation is repeated for $i - 1$ rounds.

4. Experiments and Results

This section deals with the experiments that have been conducted to validate and test the proposed genetic CFL architecture. Section 4.1 deals with the clustering of the client edge devices and the clustering behavior under various parameters. Sections 4.2 and 4.3 after DBSCAN are concerned with the performance of the genetic CFL architecture on MNIST and CIFAR-10 datasets, respectively, and their comparison with the generic FL architecture. The overall performance analysis for the genetic CFL architecture is discussed in Section 4.4.

4.1. DBSCAN Clustering of the Client Models

The DBSCAN algorithm, as discussed in the previous section, focuses on the Euclidean distance between the observations to calculate the density and cluster the observations based on this density. The models in each edge device are assigned a particular learning rate and batch size for training. These two hyperparameters serve as the primary two dimensions for each observation for the process of clustering. The DBSCAN algorithm takes two main parameters for clustering a set of observations: $ϵ$ and Min Samples. We note that $ϵ$ is the maximum Euclidean distance for an observation from the closest point in the cluster in question. The Min Samples parameter is the least number of observations possible in the clustering algorithm. Thus, the tuning and selection of these parameters become essential to obtain proper and efficient results.

Table 2 summarizes the conditions tested for the quality and effectiveness of clustering with the said parameters. For each value of $ϵ$ , two values of Min Samples are tested to validate the clustering effectiveness and detecting outliers in the data. For the $ϵ$ values 0.2 and 0.175, the number of clusters for both 1 and 2 Min Samples stay constant at 7. This constant value for the generated number of clusters for both the Min Samples indicates that there are no outliers in the data, and each observation in the cluster holds a strong relationship with each other. Since the number of clusters for both the epsilon values are the same, it is evident that the clusters are locally isolated. For the $ϵ$ values 0.150 and 0.100, the number of clusters changes drastically indicating weak clustering among the observations. The change in number of clusters for different Min Samples is proof that there are outliers in the data which can cause issues while performing the genetic optimization due to the lack of population. The parameters can therefore be safely assigned either of the four combinations to obtain 7 distinct clusters, as shown in Figure 2. The number of observations in each cluster is plotted in Figure 3.

Table 2

DBSCAN clustering parameters and outcomes.

$ϵ$	Min samples	Number of clusters
0.200	1	7
0.200	2	7

0.175	1	7
0.175	2	7

0.150	1	8
0.150	2	7

0.100	1	15
0.100	2	18

[figure omitted; refer to PDF][figure omitted; refer to PDF]

4.2. Performance of the Genetic CFL Architecture on MNIST Dataset

In this section, we discuss the performance and analyse the training curves of the models. The server model is initially trained on a subset of the MNIST handwritten digits’ dataset [29]. This model is then distributed among the clients based on the client ratio. The total number of clients chosen for this experiment is 100 and the client ratios tested for are 0.1, 0.15, and 0.3. In essence, we evaluate the performance of the models on 10, 15, and 30 clients, respectively. Each client device is provided with a random subset of the dataset with a random number of observations. This is to make sure that the data is non-IID, and the characteristics of the real-time scenario is emulated. For the initial round, the hyperparameters (learning rate and batch size) of the client devices are randomized within intervals (4) and (5), respectively. The client devices are trained for two epochs and the hyperparameters are subjected to genetic evolution as discussed in Section 3. These rounds are tabulated in Table 3, and the best performance is plotted against each round in Figure 4.

Table 3

Performance of FL against genetic CFL on MNIST dataset for various hyperparameters.

Client ratio	Rounds	FL		Genetic CFL
Client ratio	Rounds	Accuracy	Loss	Accuracy	Loss
0.1	3	0.9133	0.3136	0.9679	0.1203
	6	0.9265	0.2493	0.9730	0.1343
	10	0.9367	0.2115	0.9777	0.1923

0.15	3	0.9176	0.2878	0.9665	0.1049
	6	0.9740	0.0876	0.9740	0.0876
	10	0.9443	0.1828	0.9763	0.0910

0.3	3	0.9178	0.2989	0.9698	0.0964
	6	0.9326	0.2359	0.9780	0.0804
	10	0.9450	0.1946	0.9799	0.0849

[figure omitted; refer to PDF]

Since the model training hyperparameters are no longer predetermined, the performance of the models and their respective training are optimized locally in the cluster, thus providing a more personalized training for each cluster. The performance of the server model obtains a smooth learning curve and converges faster than the normal training of the model using FL. Table 3 represents this performance of the models for both the architectures. The superiority of performance of genetic CFL over generic FL is evident for each round. The accuracy of the genetic CFL architecture is consistently higher and the loss is consistently lower as compared to the generic FL architecture. The increase in accuracy and the decrease in loss signify that the models are indeed training and useful information is aggregated at the server.

4.3. Performance of the Genetic CFL Architecture on CIFAR-10 Dataset

This section deals with the performance and the training of the models on CIFAR-10 dataset [30] using genetic CFL architecture and its comparison with the performance of the generic FL architecture. The training process of this dataset is similar to the training of the MNIST handwritten digits’ dataset. The server initializes the model and distributes the weights of the server model to every client device; the models are trained on the random subset of the dataset assigned for two epochs; the current hyperparameters are subjected to genetic evolution; the trained weights are sent back to the server to get aggregated. This process is repeated for several rounds. The performance of the server model after each round, at the end of the aggregation phase, is plotted in Figure 5 and tabulated in Table 4.

[figure omitted; refer to PDF]

Table 4

Performance of FL against genetic CFL on CIFAR-10 dataset for various hyperparameters.

Client ratio	Rounds	FL		Genetic CFL
Client ratio	Rounds	Accuracy	Loss	Accuracy	Loss
0.10	3	0.6818	0.9540	0.6514	1.0097
	6	0.6891	1.3746	0.6639	1.3447
	10	0.6862	1.6617	0.6599	1.6449

0.15	3	0.6973	0.9612	0.7098	0.8814
	6	0.6988	1.3675	0.7199	1.693
	10	0.6952	1.6225	0.7129	1.3806

0.30	3	0.7578	0.8708	0.7688	0.7818
	6	0.7634	1.0891	0.7646	0.981
	10	0.7613	1.2964	0.7623	1.2961

The performance of the models trained on the hyperparameters that are optimized using genetic algorithm for the respective clusters is higher than those that are not. This performance is consistent with any number of client devices. The performance also improves as the client ratio increases. The lowest loss is encountered at the second round for client ratio 0.3. The accuracy however peaks at the fourth epoch with a decent amount of loss for prediction. Any further training of the models does not provide better performance causing overfitting. The training of the models is stopped at round two. The aggregated model therefore provides a significant performance boost for very few rounds. This provides speed and high throughput while deployment in a real-time system.

4.4. Performance Analysis of Genetic CFL

The genetic CFL algorithm performs better with a higher sample size. Higher number of observations per cluster should therefore improve the optimization of the hyperparameters. However, taking into consideration the diversity of datasets both in the data characteristics and the number of data points, proper clustering of similar scenarios should provide higher throughput for the models individually. This calls for a balance between the number of clusters and the size of the cluster. A proper balance can ensure that the performance of the models in the federated architecture provides the best output in the given scenario. In a real-time application, the amount of edge devices expected is higher as compared to a synthetic environment. Following the progression of the performance, the higher number of total clients increases the performance significantly. The optimization of the hyperparameters using genetic CFL provides higher throughput for comparatively less number of rounds.

Our architecture, genetic CFL, outperforms both algorithms [31, 32] in accuracy and rounds. This holds up the fact that genetic CFL architecture performs better while taking less number of rounds. In case of iterative clustering [16], our architecture outperforms in the case of MNIST dataset but does not in the case of the CIFAR-10 data. This behavior is attributed to the rotation and augmentation of data. This gives an upper hand in better feature extraction and representation. Genetic optimization provides an elastic and adaptive framework for optimization of the hyperparameters. This flexibility gives the architecture an edge over other methods by adapting to the dataset and the required environment. Most of the other types of architectures need to perform hyperparameter tuning beforehand and thus requiring manual intervention. This causes the system to be reset and a different set of parameters for a different type of data and application. This rigidity can cost both time and resources. Moreover, importance to every single client is given, thus affecting not only the server model performance but also the performance of every single client device. A better delivery of service for each and every client device is ensured while increasing the performance of the server model as a whole. Table 5 shows the comparison between the performance of our architecture, genetic CFL, with other architectures that incorporate clustering in federated learning. The table consists of the best accuracy of the models on the MNIST handwritten digits’ dataset and the CIFAR-10 dataset for a given number of rounds. It is evident that the number of rounds taken is significantly less keeping the accuracy higher.

Table 5

Performance comparison.

Algorithm	Rounds	MNIST	CIFAR-10
Genetic CFL	10	97.99	76.88
Byzantine robustness of CFL [31]	200	97.4	75.3
FedZip [32]	20	98.03	—
Iterative federated clustering [16]	—	95.25	81.51

5. Conclusion

In this work, we have applied the genetic evolutionary algorithm to optimize the hyperparameters—learning rate and batch size—during the training of the individual end device models in a cluster for the FL architecture. We have identified and filled the gaps in the existing techniques and contributed algorithm of the genetic CFL architecture. This architecture has been tested using MNIST handwritten digits’ dataset and CIFAR-10 dataset. An accuracy of 97.99% and 76.88% has been, respectively, achieved on the datasets. We discussed and analysed the observations and the performance of the genetic CFL architecture. We have also covered the favourable conditions and the limitations for the algorithm to provide the best performance in deployment. The overall performance of the models display significant rise in efficiency while reducing communication and computation cost.

As part of the future work, the amount of clients and the client ratio can be scaled into larger samples closely mimicking the real-time situation due to the high scalability of the model. As the population sample increases, the optimization of the hyperparameters gets more efficient thus delivering higher throughput in the real-time scenario. The type of data processed is not limited, and this architecture can be used for various scenarios such as natural language processing tasks, image classification tasks, and recommendation systems. Genetic CFL can also be integrated with time sensitive systems to deliver better performance in very less number of rounds.

Disclosure

This manuscript is available as a preprint in Arxiv at “https://arxiv.org/abs/2107.07233.” The code for this work is available in the repository at https://github.com/sagnik106/Clustered-FL-GA.

References

[1] C. Zhang, Y. Xie, H. Bai, B. Yu, W. Li, Y. Gao, "A survey on federated learning," Knowledge-Based Systems, vol. 216,DOI: 10.1016/j.knosys.2021.106775, 2021.

[2] M. Parimala, R. M. Swarna Priya, Q.-V. Pham, K. Dev, P. Kumar Reddy Maddikunta, T. R. Gadekallu, T. Huynh-The, "Fusion of federated learning and industrial internet of things: a survey," 2021. https://arxiv.org/abs/2101.00798

[3] M. Alazab, R. M. Swarna Priya, M. Parimala, T. R. Gadekallu, Q.-V. Pham, P. Reddy, "Federated learning for cybersecurity: concepts, challenges and future directions," IEEE Transactions on Industrial Informatics,DOI: 10.1109/tii.2021.3119038, 2021.

[4] W. Wang, M. H. Fida, Z. Lian, Z. Yin, Q.-V. Pham, T. R. Gadekallu, K. Dev, C. Su, "Secure-enhanced federated learning for ai-empowered electric vehicle energy prediction," IEEE Consumer Electronics Magazine,DOI: 10.1109/mce.2021.3116917, 2021.

[5] T. Yang, G. Andrew, H. Eichner, H. Sun, W. Li, N. Kong, D. Ramage, F. Beaufays, "Applied federated learning: Improving google keyboard query suggestions," 2018. https://arxiv.org/abs/1812.02903

[6] M. R. Sprague, A. Jalalirad, M. Scavuzzo, C. Capota, M. Neun, L. Do, M. Kopp, "Asynchronous federated learning for geospatial applications," Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 21-28, .

[7] J. Xu, B. S. Glicksberg, S. Chang, P. Walker, J. Bian, F. Wang, "Federated learning for healthcare informatics," Journal of Healthcare Informatics Research, vol. 5 no. 1,DOI: 10.1007/s41666-020-00082-4, 2020.

[8] Q.-V. Pham, M. Zeng, R. Ruby, T. Huynh-The, W.-J. Hwang, "UAV communications for sustainable federated learning," IEEE Transactions on Vehicular Technology, vol. 70 no. 4, pp. 3944-3948, DOI: 10.1109/tvt.2021.3065084, 2021.

[9] A. Nilsson, S. Smith, G. Ulm, E. Gustavsson, M. Jirstrand, "A performance evaluation of federated learning algorithms," Proceedings of the Second Workshop on Distributed Infrastructures for Deep Learning,DOI: 10.1145/3286490.3286559, .

[10] C. Fang, Y. Guo, Y. Hu, B. Ma, L. Feng, A. Yin, "Privacy-preserving and communication-efficient federated learning in internet of things," Computers & Security, vol. 103,DOI: 10.1016/j.cose.2021.102199, 2021.

[11] E. Ozfatura, K. Ozfatura, D. Gunduz, "Time-correlated sparsification for communication-efficient federated learning," 2021. https://arxiv.org/abs/2101.08837

[12] C. Briggs, "Federated learning with hierarchical clustering of local updates to improve training on non-iid data," 2020. https://arxiv.org/abs/2004.11791

[13] Y. Zhao, L. Meng, L. Lai, N. Suda, D. Civin, V. Chandra, "Federated learning with non-iid data," 2018. https://arxiv.org/abs/1806.00582

[14] H. Wang, M. Yurochkin, Y. Sun, D. Papailiopoulos, Y. Khazaeni, "Federated learning with matched averaging," 2020. https://arxiv.org/abs/2002.06440

[15] X. Yao, T. Huang, C. Wu, R. Zhang, L. Sun, "Towards faster and better federated learning: a feature fusion approach," pp. 175-179, DOI: 10.1109/icip.2019.8803001, .

[16] A. Ghosh, J. Chung, Y. Dong, R. Kannan, "An efficient framework for clustered federated learning," 2020. https://arxiv.org/abs/2006.04088

[17] K. Kopparapu, E. Lin, J. Zhao, "Fedcd: Improving performance in non-iid federated learning," 2020. https://arxiv.org/abs/2006.09637

[18] B. Keith, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konečný, S. Mazzocchi, H. Brendan McMahan, T. Van Overveldt, D. Petrou, D. Ramage, J. Roselander, "Towards federated learning at scale: system design," 2019. https://arxiv.org/abs/1902.01046

[19] M. Xie, G. Long, T. Shen, T. Zhou, X. Wang, J. Jiang, C. Zhang, "Multi-center federated learning," 2020. https://arxiv.org/abs/2005.01026

[20] C. Zheng, A. Ali, Z. Syed, S. Truex, A. Anwar, N. Baracaldo, Y. Zhou, H. Ludwig, F. Yan, Y. Cheng, "Tifl: a tier-based federated learning system," Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pp. 125-136, .

[21] A. Likas, N. Vlassis, J. Verbeek, "The global k-means clustering algorithm," Pattern Recognition, vol. 36 no. 2, pp. 451-461, DOI: 10.1016/s0031-3203(02)00060-2, 2003.

[22] C. Briggs, Z. Fan, A. Peter, "Federated learning with hierarchical clustering of local updates to improve training on non-iid data," ,DOI: 10.1109/ijcnn48605.2020.9207469, .

[23] D. Birant, A. Kut, "ST-DBSCAN: an algorithm for clustering spatial-temporal data," Data & Knowledge Engineering, vol. 60 no. 1, pp. 208-221, DOI: 10.1016/j.datak.2006.01.013, 2007.

[24] J.-Y. Kim, S.-B. Cho, "Evolutionary optimization of hyperparameters in deep learning models," pp. 831-837, DOI: 10.1109/cec.2019.8790354, .

[25] I. Aljarah, H. Faris, S. Mirjalili, "Optimizing connection weights in neural networks using the whale optimization algorithm," Soft Computing, vol. 22 no. 1,DOI: 10.1007/s00500-016-2442-1, 2018.

[26] X. Xiao, M. Yan, S. Basodi, C. Ji, Y. Pan, "Efficient hyperparameter optimization in deep learning using a variable length genetic algorithm," 2020. https://arxiv.org/abs/2006.12703

[27] G. Beruvides, R. Quiza, M. Rivas, F. Castaño, R. E. Haber, "Online detection of run out in microdrilling of tungsten and titanium alloys," International Journal of Advanced Manufacturing Technology, vol. 74 no. 9–12, pp. 1567-1575, 2014.

[28] G. Beruvides, C. Juanes, F. Castaño, R. E. Haber, "A self-learning strategy for artificial cognitive control systems," pp. 1180-1185, DOI: 10.1109/indin.2015.7281903, .

[29] Li Deng, "The MNIST database of handwritten digit images for machine learning research [best of the web]," IEEE Signal Processing Magazine, vol. 29 no. 6, pp. 141-142, DOI: 10.1109/msp.2012.2211477, 2012.

[30] A. Krizhevsky, G. Hinton, "Learning multiple layers of features from tiny images," 2009.

[31] F. Sattler, Klaus-Robert Müller, T. Wiegand, W. Samek, "On the byzantine robustness of clustered federated learning," pp. 8861-8865, DOI: 10.1109/icassp40776.2020.9054676, .

[32] A. Malekijoo, M. J. Fadaeieslam, H. Malekijou, M. Homayounfar, F. Alizadeh-Shabdiz, R. Rawassizadeh, "Fedzip: a compression framework for communication-efficient federated learning," , 2021. https://arxiv.org/abs/2102.01593

Word count: 4780

Show less

Copyright © 2021 Shaashwat Agrawal et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/

Abstract

Translate

Federated learning (FL) is a distributed model for deep learning that integrates client-server architecture, edge computing, and real-time intelligence. FL has the capability of revolutionizing machine learning (ML) but lacks in the practicality of implementation due to technological limitations, communication overhead, non-IID (independent and identically distributed) data, and privacy concerns. Training a ML model over heterogeneous non-IID data highly degrades the convergence rate and performance. The existing traditional and clustered FL algorithms exhibit two main limitations, including inefficient client training and static hyperparameter utilization. To overcome these limitations, we propose a novel hybrid algorithm, namely, genetic clustered FL (Genetic CFL), that clusters edge devices based on the training hyperparameters and genetically modifies the parameters clusterwise. Then, we introduce an algorithm that drastically increases the individual cluster accuracy by integrating the density-based clustering and genetic hyperparameter optimization. The results are bench-marked using MNIST handwritten digit dataset and the CIFAR-10 dataset. The proposed genetic CFL shows significant improvements and works well with realistic cases of non-IID and ambiguous data. An accuracy of 99.79% is observed in the MNIST dataset and 76.88% in CIFAR-10 dataset with only 10 training rounds.

Details

Title

Genetic CFL: Hyperparameter Optimization in Clustered Federated Learning

Author

Agrawal, Shaashwat¹

; Sarkar, Sagnik¹

; Alazab, Mamoun²

; Praveen Kumar Reddy Maddikunta³

; Thippa Reddy Gadekallu³

; Quoc-Viet Pham⁴

¹ School of Computer Science and Engineering, Vellore Institute of Technology, Vellore 632014, India
² College of Engineering, IT and Environment, Charles Darwin University, Casuarina 0909, NT, Australia
³ School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, India
⁴ Korean Southeast Center for the 4th Industrial Revolution Leader Education, Pusan National University, Busan 46241, Republic of Korea

Editor

Rodolfo E Haber

Publication year

2021

Publication date

2021

Publisher

John Wiley & Sons, Inc.

ISSN

16875265

e-ISSN

16875273

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2021/7156420

ProQuest document ID

2603590716

Genetic CFL: Hyperparameter Optimization in Clustered Federated Learning

Jump to:

Full Text

Abstract

Details

Suggested sources