Hierarchical distributed edge data aggregation and reporting method based on cluster center selection

Abstract

With the surge of IoT devices, sensors, and smart terminals has led to distributed data sources and vast volumes of data. These challenges traditional centralized networks and cloud computing architectures, which struggle with bandwidth, latency, and storage limitations. Consequently, decentralized edge computing is crucial, enabling data processing and analysis at the network's edge to alleviate data return pressure and enhance system response speed and reliability. However, traditional centralized data aggregation methods become inefficient in the face of massive data and computing resources, resulting in long transmission times and low processing efficiency. To address these issues, this paper presents a hierarchical distributed edge data aggregation reporting method based on cluster center selection (HDAR-CCS). This method employs a staged approach to distributed data aggregation, utilizing parallel processing at each stage to efficiently handle data from multiple edge data centers. Additionally, an optimal cluster center selection algorithm is proposed, integrating the distances between cluster centers and available network resources. By establishing a selection criterion based on these distances, we design an effective scheme for choosing initial and subsequent cluster centers. Experimental results demonstrate that our approach outperforms existing algorithms, effectively meeting the low latency, high bandwidth, and efficient processing needs of intelligent applications.

Full text

Translate

Turn on search term navigation

Introduction

With the rapid development of artificial intelligence, IoT, 5G, and other technologies, traditional centralized networks and cloud computing architectures face many challenges in handling the communication and computing needs of massive data. This is mainly reflected in the following aspects: First, there are significant delays in transferring all the data to the cloud, as well as the potential for data loss. Secondly, centralized cloud computing may lead to bandwidth bottlenecks and can easily cause congestion when processing a large number of concurrent requests. In other words, the communication and computation of massive data make it difficult for networks and cloud computing to meet the current high bandwidth and low latency requirements of emerging technologies and industries. In addition, in the military field, the current intelligent warfare and the complex battlefield environment lead to frequent data interruptions, resulting in difficulties in uninterrupted data transmission, poor edge data aggregation, and reporting capabilities. However, the traditional big data processing model based on cloud computing can no longer meet end users' requirements for high efficiency and low energy consumption in edge data processing. Therefore, the problems and challenges faced by both civilian and military fields are driving the expansion of computing power from cloud computing to edge networks [1].

Edge networking refers to an architecture that distributes computing and data processing capabilities to the edge of the network, improving the speed and efficiency of data processing. Compared to traditional centralized cloud computing, edge networks deploy resources closer to users and data sources or in areas with better signals, such as on smart devices, routers, or edge servers [2, 3]. By computing and processing data locally on the edge network, it can significantly reduce data transmission time and decrease the amount of data sent to the cloud, thereby optimizing bandwidth resources and reducing network congestion. In the edge network, data is stored across multiple different edge devices (nodes). Typically, data from the edge nodes must first be aggregated at the edge data center (EDC) before being uploaded to the cloud. Consequently, one of the current challenges is how to efficiently aggregate the data stored in edge nodes at edge data centers, minimize the data aggregation time, and better meet the requirements of intelligent applications for low latency, high bandwidth, and efficient processing.

As for how edge nodes aggregate data to edge data centers, existing solutions can be categorized into two types: centralized data aggregation solutions and distributed data aggregation solutions [4], as illustrated in Fig. 1. Due to the limited processing capacity of a single EDC and the increasing demand for data-intensive services, centralized data aggregation solutions have become inefficient or unfeasible [5]. Therefore, a distributed data aggregation solution is necessary to process data at the edge by routing a large amount of data to multiple EDCs. However, selecting a distributed data aggregation solution presents some challenges. On one hand, if the chosen EDC is located far from certain edge nodes that store the required data, the transmission delay will increase. On the other hand, if the EDC stores a substantial amount of data, network resources between the edge centers may become constrained. In such cases, transmitting data from multiple EDCs to an edge aggregation center can result in intense competition for network resources near the edge data aggregation center (cluster center), ultimately affecting the stability of data transmission. Thus, determining how to effectively select the EDC to serve as the cluster center and successfully complete edge-to-cloud data aggregation and reporting remains a significant challenge.

[See PDF for image]

Fig. 1

Data aggregation solutions

In order to address the two challenges mentioned above, we propose a new solution: a hierarchical distributed data edge aggregation and reporting method based on cluster center selection. The method structure diagram is shown in Fig. 2. The contributions of this paper are as follows:

Construction of a hierarchical distributed edge data aggregation and reporting model: To more efficiently handle the issues of big data transmission and aggregation across multiple EDCs in complex environments, we propose a hierarchical distributed edge data aggregation and reporting method. By combining computing and network resources, this method minimizes data aggregation time while ensuring low bandwidth consumption. Specifically, a distributed data aggregation solution is adopted, dividing the aggregation process into multiple stages, with a parallel mode used at each stage to process the distributed data.
Cluster center optimal selection algorithm: To more effectively select edge data centers, we propose an optimal cluster center selection algorithm that integrates the distance between cluster centers and available network resources. By formulating a judgment metric as the selection criterion for the initial cluster center, we design a selection scheme for the remaining cluster centers based on the distances between them. Finally, we develop a cluster partitioning method to divide the cluster members for each cluster center.

[See PDF for image]

Fig. 2

Hierarchical distributed edge data aggregation reporting method based on cluster center selection structure diagram

The rest of this paper is organized as follows. In "Related work", we introduce related works on current data aggregation methods. In "Hierarchical distributed edge data aggregation and reporting construction", we elaborate on the established hierarchical distributed edge data aggregation and reporting mathematical model in detail, demonstrating the superiority of our proposed scheme through mathematical reasoning. In "Cluster center optimal selection algorithm", we propose an optimal cluster center selection algorithm to address the issues of cluster center selection and partitioning in EDC. In "Experimental results and analysis", we present experimental results that compare the performance of the proposed method. Finally, "Conclusion" concludes the paper.

Related work

In the previous section, we discussed how edge data centers can aggregate data to cluster centers. Existing solutions can be categorized into two types: centralized data aggregation solution and distributed data aggregation solution. The following section will provide a detailed explanation of the current research status of both types of solutions.

Centralized data aggregation solution: As shown in Fig. 1a, the centralized data aggregation algorithm selects only one cluster center, which collects and stores information from all nodes in the network. If any node wants to communicate with another node, it must request the cluster center. Reference 1 proposes a cluster-based IoT cognitive data aggregation model to address the weaknesses of energy-constrained devices. This model employs reasoning and learning elements to achieve cognition, improve energy consumption, and ensure service quality. Reference 2 proposes a multi-job task allocation strategy to achieve maximum and minimum fairness in job completion time among jobs sharing these data centers. Reference 3 presents a scalable and load-balancing scheme for data aggregation based on mobile agents, optimizing performance in terms of energy consumption, response time, and network lifespan during data aggregation. Additionally, several other centralized data aggregation algorithms have proposed related improved algorithms concerning network [6, 7], mobile agent routing [8], protocol [9], energy consumption [10], and more.

Distributed data aggregation solution: As shown in Fig. 1b, the distributed data aggregation algorithm can select multiple data aggregation centers. The remaining nodes can choose which data aggregation center to aggregate the stored or processed data resources based on indicators such as geographical location [11], latency [12, 13–14], bandwidth consumption [15], and energy consumption [16].

The above algorithms are more suitable for network scenarios with small networks and data volumes because the storage and computing capabilities of a single cluster center are limited. As data volumes continue to increase, EDC will inevitably face challenges in bandwidth and time when transmitting and aggregating large amounts of data. This indicates that centralized data aggregation methods cannot effectively handle the ever-growing data. Consequently, distributed data aggregation solutions have emerged.

Li et al. [17] designs a hierarchical data aggregation architecture (LDA-EPP) based on fog computing, where fog nodes and the cloud are responsible for local and global aggregation, respectively, effectively reducing the amount of data transmitted. Pallavi et al. [18] proposes a data aggregation scheduling algorithm (DEDA) for aggregating data in tree-based distributed sensor networks. An effective aggregation tree structure based on the Dijkstra algorithm is employed to minimize the possibility of data retransmission. Liu et al. [19] introduces a multi-stage geo-distributed data aggregation algorithm (MGDD-CC) that combines computing and communication resources to collect geo-distributed data stored in multiple edge data centers into a single edge data center, thereby reducing response delays and conserving bandwidth resources. Additionally, other distributed data aggregation algorithms include a privacy-preserving in-network aggregation algorithm [20], efficient distributed algorithms for holistic aggregation functions on random regular graphs [21], a secret confusion-based energy-saving and privacy-preserving data aggregation algorithm [22] and lightweight and verifiable secure aggregation for multi-dimensional data [23], among others.

Most of the algorithms mentioned above focus on only one aspect when planning the data aggregation scheme and lack a comprehensive approach for selecting cluster centers and dividing group members. Moreover, these methods are not well-suited for the edge network environment studied in this paper. Therefore, we propose a hierarchical distributed data edge aggregation and reporting method based on cluster center selection for large-scale data in edge network environments. The proposed solution addresses three key issues: (1) How to construct and optimize the edge data aggregation model; (2) How to select the initial cluster center and subsequent cluster centers; (3) How to divide the members of each cluster based on the cluster center. "Hierarchical distributed edge data aggregation and reporting construction" will detail the construction and optimization process for Problem 1, while Problems 2 and 3 will be addressed in "Cluster center optimal selection algorithm".

Hierarchical distributed edge data aggregation and reporting construction

Problem description: Since data is stored in multiple EDCs storing the required data located far from the cluster center will increase transmission delays. Furthermore, if the amount of data stored in the EDC is large and network resources between the edge data centers are limited, transmitting data from multiple EDCs to a cluster center can lead to significant competition for network resources near the cluster center, negatively impacting the stability of data transmission. Therefore, to more efficiently address the challenges of big data transmission and aggregation from multiple EDCs in complex environments, a hierarchical distributed edge data aggregation and reporting method is proposed. This method minimizes data aggregation time while ensuring low bandwidth consumption by combining computing and network resources. Specifically, a distributed data aggregation scheme is implemented, dividing the data aggregation process into multiple stages, with data from various locations processed in parallel at each stage.

Problem definition: Based on the problem description above, a hierarchical distributed data edge aggregation and reporting model is constructed, as shown in Fig. 3. We define each stage to include multiple clusters. Within each cluster, data is stored in multiple edge data centers. Each cluster must select an EDC as its current cluster center, and the data from the edge data centers must be transferred to this cluster center. Additionally, the time required for the current cluster to complete data aggregation is defined as CT, the time needed to complete the task in the current stage is defined as ST, and the total time to finish the work is defined as TT. TT is equal to the sum of the completion times of multiple stages. Therefore, to reduce TT, efforts must be made to shorten ST.

[See PDF for image]

Fig. 3

Hierarchical distributed data edge data aggregation reporting model

Current stage task completion time model

Problem description: As shown in Fig. 4, an example of the completion time for the current stage is provided.

[See PDF for image]

Fig. 4

Example of completion time for the current phase

At stage s, we assume that the data required by the job is stored in seven edge data centers. These edge data centers are divided into two clusters (Cluster 1 and Cluster 2), with the cluster centers (CC) of Cluster 1 and Cluster 2 denoted as ${EDC}_{3} ({CC}_{1})$ and ${EDC}_{4} ({CC}_{2})$ respectively. Each cluster's completion time (CT) comprises communication time (ct) and data processing time (pt). In cluster c, the communication time between the EDC and the CC is recorded as ${ct}_{c, (E D C, C C)}^{s}$ ; for instance, ${ct}_{1 (1, 3)}^{s}, {ct}_{1 (2, 3)}^{s}, {ct}_{2 (1, 4)}^{s}, {ct}_{2 (2, 4)}^{s}, {ct}_{2 (3, 4)}^{s}$ in the figure; The data processing time of the current cluster is recorded as ${pt}_{c}^{s}$ , such as ${pt}_{1}^{s}$ and ${pt}_{2}^{s}$ in the figure. As illustrated in the figure, at stage s, the time taken for Cluster 1 and Cluster 2 to complete data aggregation is ${CT}_{1}^{s}$ and ${CT}_{2}^{s}$ , respectively, and the task completion time for the current stage, denoted as ST, depends on the completion time of the bottleneck cluster.

Problem definition: According to the above problem description, ST depends on the completion time of the bottleneck cluster. We define the task completion time of the current stage ST as:

{ST}^{s} = max {CT}_{c}^{s}

CT includes communication time (ct) and data processing time (pt), so CT is defined as:

{CT}_{c}^{s} = {ct}_{c}^{s} + {pt}_{c}^{s}

Communication time ct model:

In computer networks, data communication time consists of send time, transmission time, node processing time, and queuing time. The send time (st) is the duration required for a node to send a data frame. It is proportional to the node's data volume (DV) and inversely proportional to the network's available bandwidth (AB). The definition of send time is presented in Formula (3). Transmission time (tt) is the time required for data to be transmitted from one node to another, affected by the link distance ( $d_{E D C, {EDC}^{'}}$ ) between the two nodes. The transmission time is defined as shown in Formula (4). Node processing time and queuing time refer to the duration it takes for a host, router, or switch to receive a data packet and process the packet queue. This time is typically in the order of microseconds or less, which is significantly shorter than the send time and transmission time. Therefore, our primary focus is on the send time and transmission time.

{st}_{E D C, {EDC}^{'}}^{s} = \frac{{DV}_{EDC}^{s}}{{AB}_{E D C, {EDC}^{'}}^{s}}

{tt}_{E D C, {EDC}^{'}}^{s} = \frac{d_{E D C, {EDC}^{'}}}{v}

Where $v$ is the transmission rate of the data packet in the link. In summary, the communication time between $EDC$ and the ${EDC}^{'}$ is:

\begin{matrix} {ct}_{E D C, {EDC}^{'}}^{s} & = {st}_{E D C, {EDC}^{'}}^{s} + {tt}_{E D C, {EDC}^{'}}^{s} \\ = \frac{{DV}_{EDC}^{s}}{{AB}_{E D C, {EDC}^{'}}^{s}} \\ + \frac{d_{E D C, {EDC}^{'}}}{v} \end{matrix}

Data processing time pt model

The data processing time on the EDC is proportional to the amount of node data and inversely proportional to the available computing resources of the EDC. The data processing time in the EDC at stage s is defined as follows:

{pt}_{c}^{s} = \frac{{DV}_{c}^{s}}{{AR}_{c}^{s}}

Therefore, according to the above formula, ST is finally transformed into:

{ST}^{s} = max {ct}_{c}^{s} + {pt}_{c}^{s} = m a x (\frac{{DV}_{c}^{s}}{{AB}_{c}^{s}} + \frac{d_{c, c^{'}}}{v}) + \frac{{DV}_{c}^{s}}{{AR}_{c}^{s}}

Job completion time model

Problem description: According to the above discussion, ST depends on the completion time of the bottleneck cluster, and TT is equal to the sum of the completion times of multiple stages, that is, the job (task) completion time TT model is

T T = \sum_{s = 1}^{S} {ST}^{s}

At this time, TT is entirely dependent on the completion time of the bottleneck cluster. If the completion time of the bottleneck cluster at the current stage is significantly longer than the aggregation time of the other clusters, those clusters must wait for the bottleneck cluster to finish before they can collectively enter the next stage of data aggregation. This delay will not only prolong the TT time but also seriously hinder the efficiency of data aggregation and reporting.

Problem definition: According to the problem description, to further reduce TT and improve the efficiency of edge data reporting, we propose not using the completion time of the bottleneck cluster as the standard for data aggregation to enter the next stage. Instead, our data aggregation solution allows us to proceed to the next stage immediately when only the bottleneck cluster has not completed data aggregation in the current stage, without waiting for it to finish. The following section demonstrates the effectiveness of our proposed aggregation solution.

Model of work completion time before optimization

As shown in Fig. 5a, an example of using the pre-optimization solution to aggregate data and report the completion time of a task is presented. This task consists of three stages, with each stage divided into 4, 2, and 1 clusters, respectively. The data aggregation completion time for each cluster is provided. In this case, the total completion time for the task is equal to the sum of the completion times of the bottleneck clusters in each stage, that is,

[See PDF for image]

Fig. 5

a Example of work plan completion. b Example of the completion time before optimization time of the optimized work plan

{TT}_{pre} = \sum_{s = 1}^{S} {ST}^{s} = \sum_{s = 1}^{S} m a x {CT}_{c}^{s}

Optimized solution work completion time model

As shown in Fig. 5b, an example of the completion time for data aggregation and reporting using the optimized solution is presented. Unlike the pre-optimization scheme, the time model of the post-optimization scheme no longer exclusively depends on the completion time of the bottleneck cluster. The primary concept is that when only the bottleneck cluster has not completed data aggregation in the current stage, the process does not wait for it to finish. Instead, it immediately proceeds to the next stage. At this point, the remaining time for the bottleneck cluster to complete its work in the current stage is $Δ t^{S}$ . In summary, the total completion time after optimization ${TT}_{opt}$ is

{TT}_{opt} = \sum_{s = 1}^{S - 1} s e c o n d max ({CT}^{s}) + {ST}^{s} + Δ t^{S - 1}

Among them, $s e c o n d max ({CT}^{s})$ means that the cluster completion time in the current stage is just less than the bottleneck cluster.

From Formulas (9) and (10), we can see that compared with ${TT}_{pre}$ , the total time to complete the work of the optimized solution ${TT}_{opt}$ no longer depends only on the bottleneck cluster. The size of ${TT}_{opt}$ depends on $Δ t^{S - 1}$ , and the value range of $Δ t^{S - 1}$ needs to consider the relationship between the cluster centers of the previous and next stages, that is, whether the cluster center of the current stage is the cluster center of the next stage. Therefore, the following compares the completion time of the two solutions ( ${TT}_{pre}$ and ${TT}_{opt}$ ) by analyzing the value range of $Δ t^{S - 1}$ .

Comparison of completion time of the two solutions

Situation 1: As shown in Fig. 6a, when the cluster centers of the current stage are identical to those of the next stage, the unfinished work of the bottleneck cluster in the current stage and the data aggregation work in the next stage do not affect each other and can be conducted simultaneously. Thus, in this scenario, there is no need to consider the impact of the remaining time for the bottleneck group to complete the work in the current stage on the time required to complete the work in the next stage, that is, $Δ t^{s} = 0$ . At this point, ${TT}_{opt}$ produces the minimum value ${TT}_{opt}^{\min}$ . According to formula 10, ${TT}_{opt}^{\min}$ is

[See PDF for image]

Fig. 6

a Situation 1. b Situation 2

{TT}_{opt}^{\min} = \sum_{s = 1}^{S - 1} s e c o n d max ({CT}^{s}) + {ST}^{s}

Comparison between ${TT}_{opt}^{\min}$ and ${TT}_{pre}$ :

\begin{matrix} T T_{pre} & = \sum_{s = 1}^{S} m a x C T_{c}^{s} \\ = \sum_{s = 1}^{S - 1} (s e c o n d max (C T^{s}) + Δ t^{s}) + S T^{s} \\ = \sum_{s = 1}^{S - 1} s e c o n d max (C T^{s}) + S T^{s} + \sum_{s = 1}^{S - 1} Δ t^{s} \\ = T T_{opt}^{\min} + \sum_{s = 1}^{S - 1} Δ t^{s} \end{matrix}

Among them, $\sum_{s = 1}^{S - 1} Δ t^{s} > 0$ , so the current situation is: ${TT}_{opt}^{\min} < {TT}_{pre}$ .

Situation 2: As shown in Fig. 5b, when the cluster centers of the bottleneck cluster from the previous stage are all located within the bottleneck cluster in the current stage, the completion time of the current stage's bottleneck cluster is the moment when the EDC converges to the cluster center. Therefore, the work completion time of the cluster ${CT}_{c}^{s}^{'}$ is:

{CT}_{c}^{s}^{'} = {ct}_{c}^{s} + {pt}_{c}^{s} + Δ t^{s} = {CT}_{c}^{s} + Δ t^{s}

Among them, ${CT}_{c}^{s} + Δ t^{s}$ = ${ST}^{s}$ , that is, ${CT}_{c}^{s}^{'} = {ST}^{s}$ . At this time, ${TT}_{opt}$ produces the maximum value ${TT}_{opt}^{\max}$ . Combined with Formula (10), we have

\begin{matrix} T T_{opt}^{\max} & = \sum_{s = 1}^{S - 1} s e c o n d max (C T^{s}) + S T^{s} + Δ t^{S - 1} \\ = (\sum_{s = 1}^{S - 1} s e c o n d max (C T^{s}) + Δ t^{S - 1}) + S T^{s} \\ = \sum_{s = 1}^{S - 1} C T_{c}^{s^{'}} + S T^{s} \\ = \sum_{s = 1}^{S - 1} S T^{s} + S T^{s} = \sum_{s = 1}^{S} S T^{s} = T T_{pre} \end{matrix}

So the current situation: ${TT}_{opt}^{\max} = {TT}_{pre}$

Situation 3: When the cluster center of the current stage may also serve as the cluster center of the next stage, it is important to consider the completion time of this stage in conjunction with the two extreme cases mentioned above. According to the description, the work completion time of the current situation ${TT}_{opt}^{mid}$ falls between the work completion times of situation 1 and situation 2, which means that: ${TT}_{opt}^{\min} < {TT}_{opt}^{mid} < {TT}_{opt}^{\max}$

In summary, through analysis, we can conclude that ${TT}_{opt} \leq {TT}_{pre}$

That is, the optimized solution work completion time model we proposed can consume less time to complete the same work. Finally, the optimized solution work completion time model TT is:

TT = \{\begin{matrix} S T^{1}, s = 1 \\ \sum_{s = 1}^{S - 1} sec o n d max (C T^{s}) + S T^{s} + Δ t^{S - 1}, s \geq 2 \end{matrix})

Calculate $Δ t^{S - 1}$

As shown in Fig. 6b, $Δ t^{S - 1}$ is a value that is accumulated from the second stage to the s-1th stage, so we can define $Δ t^{S - 1}$ as:

Δ t^{S - 1} = \sum_{S = 1}^{s - 1} Δ t

Among them, $Δ t$ is the difference between $ST$ and $s e c o n d maxCT$ in each stage, so we need to determine the $ST$ and $s e c o n d maxCT$ of each stage. Based on the above analysis, we propose an algorithm for calculating $Δ t^{S - 1}$ . The basic idea is to calculate $Δ t$ of each stage by discussing the different situations, and continuously accumulate $Δ t$ . The final output of the accumulated $Δ t$ value is the desired $Δ t^{S - 1}$ . The pseudo code of the algorithm for calculating $Δ t^{S - 1}$ is shown in the Table 1:

Table 1. Calculate $Δ t^{S - 1}$ algorithm

[See PDF for image]

Cluster center optimal selection algorithm

Problem description: The location of the cluster center impacts data communication time. If the cluster center is too concentrated, it can lead to intense competition for network resources in its vicinity, thereby increasing transmission delays. Additionally, if the selected cluster center has insufficient network resources, it will negatively affect data processing time. To address these issues, we propose an optimal selection algorithm for cluster centers that considers both the distance between cluster centers and the availability of network resources. The steps of the algorithm are as follows:

Step 1: Select the initial cluster centers

Problem description: For data clustering problems, the selection of the initial cluster center significantly impacts the results of data aggregation. If the initial value is not chosen effectively, the clustering results may be suboptimal. Therefore, considering the effects of latency and bandwidth on cluster division, a selection scheme for the initial cluster center is proposed.

Problem definition: According to the problem description above, we consider the shortest path between nodes ( $P_{\min}$ ) and the available bandwidth in the network resources ( $AB$ ) as the judgment metric to determine the selection of the initial cluster center at this stage. The judgment metric $m_{i}^{s}$ is defined as follows:

m_{i}^{s} = W {\bar{P}}_{m i n, i} \cdot {\bar{AB}}_{m a x, i}

Among them, $W {\bar{P}}_{m i n, i}$ represents the weight value of the average shortest path distance between ${EDC}_{i}$ and other $EDCs$ . ${\bar{AB}}_{m a x, i}$ represents the maximum average available bandwidth between ${EDC}_{i}$ and other $EDCs$ , where the maximum available bandwidth is determined by the link with the minimum available bandwidth in the path. We define that the smaller the contribution of the average shortest path distance ( ${\bar{P}}_{m i n, i}$ ) between ${EDC}_{i}$ and the other EDCs to the total distance within the cluster, the more likely it is to become the initial cluster center, thus it will receive a higher weight. $W {\bar{P}}_{m i n, i}$ is defined as follows:

First, define two constraints for $W {\bar{P}}_{m i n, i}$ :

Constraint 1: 0 $< W {\bar{P}}_{m i n, i} \leq 1$

Constraint 2: $\sum_{i = 1}^{n} W {\bar{P}}_{m i n, i} = 1$

According to the constraints, define $W {\bar{P}}_{m i n, i}$ :

W {\bar{P}}_{m i n, i} = \frac{\sum_{i = 1}^{n} {\bar{P}}_{m i n, i} - {\bar{P}}_{m i n, i}}{\sum_{i = 1}^{n} {\bar{P}}_{m i n, i} \cdot (n - 1)}

According to Formula (17), the final judgment metric m_i^s formula is:

m_{i}^{s} = \frac{\sum_{i = 1}^{n} {\bar{P}}_{m i n, i} - {\bar{P}}_{m i n, i}}{\sum_{i = 1}^{n} {\bar{P}}_{m i n, i} \cdot (n - 1)} \cdot {\bar{AB}}_{m a x, i}

According to Formula (18), the ${EDC}_{i}$ with the largest $m_{i}^{s}$ is selected as the initial cluster center of the current stage s.

Step 2: Select the remaining cluster centers

Problem description: After obtaining the initial cluster center using Formula (18), the remaining cluster centers need to be determined. As described earlier, if the locations of the cluster centers are too concentrated, this can increase data communication time and negatively affect subsequent cluster division, resulting in poor data aggregation. Therefore, a selection scheme for the remaining cluster centers is designed by comparing the distances between the cluster centers.

Problem definition: We define the cluster center $CC$ . Assuming the number of clusters to be divided is $c (c \geq 1)$ then $C C = \{{CC}_{1}, {CC}_{2}, {CC}_{3}, \dots, {CC}_{c}\}$ . We consider the shortest path between cluster centers as the judgment criterion; that is, the farther a point is from the currently determined cluster center, the higher the probability it has of being selected as the next cluster center to complete the selection of the remaining centers. The specific solution is as follows:

When $c = 1$ , it means that only one cluster center is selected at the current stage. At this time, the EDC with the largest $m_{i}^{s}$ can be selected as the cluster center;
When $c = 2$ , we first select the EDC with the largest $m_{i}^{s}$ as the first cluster center $c_{1}$ , and then select the EDC farthest from $c_{1}$ as the second cluster center $c_{2}$ ;
When $c > 2$ , assuming that $n^{'}$ cluster centers have been selected $(2 \leq n^{'} < c)$ , the steps for selecting the $(n^{'} + 1)$ th cluster center are as follows:
Calculate the centers of the first $n^{'}$ cluster centers $cen$ :
20
$c e n = (\frac{1}{n^{'}} \sum_{i = 1}^{n} x_{i}, \frac{1}{n^{'}} \sum_{i = 1}^{n} y_{i})$

Among them, $(x_{i}, y_{i})$ is the coordinate value of the cluster center.

Select the $(n^{'} + 1)$ th cluster center: After obtaining $cen$ , calculate the distance between calculate the distance between the remaining EDC and $cen$ , select the edge node farthest from $cen$ as the ( $n^{'} + 1$ ) th cluster center.

Step 3: Divide clusters

Problem description: The number of cluster members can impact both the number of STs and the number of division stages. When there are too many cluster members, competition for network resources around the cluster center becomes intense, leading to increased transmission times. Conversely, if there are too few cluster members, the number of stages required to complete the job will increase. To address these issues, a cluster division method is proposed.

Problem definition: Given the number of clusters c, the interval of the number of EDCs $nn$ allocated to each cluster is:

1 \leq n n \leq ⌈ \frac{n - c}{c} ⌉, n n \in I

When $c = 1$ , there is only one cluster at the current stage, and there is no need to divide the cluster. The remaining EDCs are all cluster members of the initial cluster center;
When $c \geq 2$ , first sort the initial cluster center and the $P_{\min}$ of the other remaining EDCs in ascending order, and select the first $nn$ EDCs as cluster members of the initial cluster center;
Then, sort the current cluster center ${CC}_{n} (2 \leq n \leq c)$ and the remaining nodes $P_{\min}$ of other undivided clusters in ascending order, and select the first nn EDCs as cluster members of the current cluster center;
Repeat 3.3 until the division of the cluster center ${CC}_{n - 1}$ is completed. Finally, the remaining EDCs of the undivided clusters are divided into cluster members of the cluster center ${CC}_{c}$ .

In summary, the optimal cluster center selection algorithm proposed in this paper is shown in Fig. 7. In addition, Table 2 summarizes the major natations.

[See PDF for image]

Fig. 7

Flowchart of the optimal cluster center selection algorithm

Table 2. Major natations

Natation	Definition
${EDC}_{i}$	Edge data center $i$
${CC}_{1}$	Cluster center $i$
CT	Current cluster's completion time
ST	Current stage's completion time
TT	Total completion time of current task
s	The number of the stage
$c$	The number of clusters to be divided
$nn$	The interval of the number of EDCs $nn$ allocated to each cluster
${ct}_{c, (E D C, C C)}^{s}$	In cluster c, the communication time between the $EDC$ and the CC
${pt}_{c}^{s}$	The data processing time of the cluster c in the stage s
${st}_{E D C, {EDC}^{'}}^{s}$	The send time between the $EDC$ and the ${EDC}^{'}$ in the stage s
${tt}_{E D C, {EDC}^{'}}^{s}$	The transmissio between the $EDC$ and the ${EDC}^{'}$ in the stage s
$v$	the transmission rate of the data packet in the link
AB	Available bandwidth
DV	The node's data volume
$d_{E D C, {EDC}^{'}}$	The link distance between the $EDC$ and the ${EDC}^{'}$
$Δ t^{S}$	The remaining time for the bottleneck cluster to complete its work in the stage s
$P_{\min}$	The shortest path between nodes
$m_{i}^{s}$	The judgment metric
$W {\bar{P}}_{m i n, i}$	The weight value of the average shortest path distance between ${EDC}_{i}$ and other $EDCs$
${\bar{AB}}_{m a x, i}$	The maximum average available bandwidth between ${EDC}_{i}$ and other EDCs

Experimental results and analysis

In this section, we aim to demonstrate the performance of the edge data aggregation and reporting algorithm HDAR-CCS proposed in this paper. First, we compare our improved method with the one presented in reference [19] to establish its effectiveness. Additionally, to further highlight the superiority of our approach, we selected two recent distributed data aggregation algorithms—LDA-EPP from [17] and DEDA from [18]—as comparison methods. We evaluated these algorithms based on three indicators: ST, TT and bandwidth consumption. The following details outline our experimental simulation and result analysis.

Experimental environment

The methods proposed in this paper are implemented using Matlab 2020. The experimental data were obtained by simulating a tactical edge network comprising several nodes within a 20 km × 20 km area. In this network, each node is assumed to be equipped with an edge center, configured with the parameters necessary for the algorithm. Bidirectional links are established between nodes, and each node is allocated a resource range of [10, 100] Gbit. All algorithms were executed on a computer featuring a Core i7-9700 CPU, 3.00 GHz, and 32.0 GB of RAM.

Performance comparison between MGDD-CC and HDAR-CCS

ST analysis

To evaluate the performance superiority of the proposed HDAR-CCS model and MGDD-CC, and to mitigate the variability of the experimental results, we selected the algorithms and applied them to tactical edge networks with 25, 38, and 50 nodes, respectively. In each edge network, we first established different numbers of cluster members (nn + 1) to determine the number of stages (s) required for the two algorithms to complete data aggregation. We then used the cluster center selection algorithm proposed in "Cluster center optimal selection algorithm" to identify the cluster centers and cluster member divisions for each stage of both methods. Subsequently, we employed the two algorithms to complete data aggregation and recorded the time (ST) required to finish the work at each stage. The experimental results are presented in Figs. 8, 9 and 10.

[See PDF for image]

Fig. 8

ST of two algorithms in 25-node edge network

[See PDF for image]

Fig. 9

ST of two algorithms in 38-node edge network

[See PDF for image]

Fig. 10

ST of two algorithms in 50-node edge network

As shown in Figs. 8, 9 and 10, we observe that the ST of HDAR-CCS in the first s-1 stages is less than or equal to the ST of each corresponding stage for HDAR-CCS. At stage s, the STs of the two methods are equal. This equality arises because we determined the number of clusters and the division of cluster members in each stage using the cluster center selection algorithm proposed in "Cluster center optimal selection algorithm". Thus, the cluster centers and member divisions for both methods are the same at each stage. Additionally, there is only one cluster in the last stage, and according to Formula (15), the ST of HDAR-CCS in this stage is $m a x ({ST}^{S})$ , which is the same as MGDD-CC. Consequently, at stage s, excluding $Δ t^{S - 1}$ , the STs of the two methods are equal.

We also observe that as the number of cluster members increases, the number of stages required to complete a job decreases, while the overall trend in ST increases. This occurs because, on one hand, competition for network resources around the cluster center intensifies, leading to increased transmission time; on the other hand, the amount of data transmitted to a single cluster center rises, resulting in higher processing time and thus greater ST.

Finally, after a closer look, we can find that for the 25-node and 38-node networks, when the number of cluster members (nn + 1) is 4, the ST of each stage of the two methods tends to be maximum, while for the 50-node network, when nn + 1 = 5, the ST of each stage tends to be maximum. This is because, although the overall trend is that ST increases with the number of cluster members, as the edge network becomes more and more complex with the increase in the number of deployed edge nodes, the structure of the network itself becomes more and more complex. Therefore, specific experimental analysis is needed to select an optimal cluster member division scheme, so as to build an edge data aggregation and reporting model that is most suitable for the current edge network.

Based on the analysis above, the results indicate that the performance of the HDAR-CCS proposed in this paper is better than that of MGDD-CC. Additionally, the experiments demonstrate the validity of the optimization model presented in this paper.

Analysis of the optimal cluster member division scheme

According to the description in the previous section, we need to select an optimal scheme for dividing cluster members. To do this, we compare the work completion time (TT) of two methods. Figure 10 presents the comparison results of TT for the two methods, varying the number of cluster members across different node networks. The specific experimental results are detailed in Table 3. Due to space limitations, in the method column of Fig. 11, we refer to method 1 as MGDD-CC and method 2 as HDAR-CCS.

Table 3. Experimental results of two methods

	TT
nn + 1	3		4		5
Method	MGDD-CC	HDAR-CCS	MGDD-CC	HDAR-CCS	MGDD-CC	HDAR-CCS
25 nodes network	10.2025	10.0329	11.9289	11.8702	7.2798	7.27
38nodes network	14.5989	13.3157	17.3346	14.3362	14.531	14.0774
50 nodes network	15.7589	13.53	13.2446	12.3851	15.8576	14.7437

Bold numbers indicate the time of the minimum TT value for the current node network using the two methods

[See PDF for image]

Fig. 11

Comparison results of TT between the two methods

As shown in Table 3, for both 25 and 38 node networks, the TT values of both algorithms are smallest when the number of cluster members is 5. Therefore, we select 5 cluster members and 2 stages as the optimal cluster member division scheme for edge networks. Similarly, for 50 node networks, the TT is minimized when nn + 1 equals 4. In this case, we choose 4 cluster members and 3 stages as the optimal cluster member division scheme for edge networks. Additionally, as illustrated in Fig. 10, we can clearly see that as the number of network nodes increases and the network scale becomes more complex, the HDAR-CCS algorithm consumes less time to complete tasks compared to MGDD-CC. This improved performance is particularly noticeable. Consequently, in addressing the communication and computing requirements for processing massive data, the method proposed in this paper is well-suited for the current edge network environment. It can efficiently process and aggregate data at the edge and facilitate effective edge-to-cloud data aggregation and reporting.

Performance comparison between other algorithms and HDAR-CCS

To further demonstrate the superiority of the proposed method, we selected two recent distributed data aggregation algorithms—LDA-EPP [17] and DEDA [18]—as comparison methods. We evaluated the performance of the proposed method against these algorithms using work completion time (TT) and bandwidth consumption as indicators. Figures 12 and 13 present the comparison results for TT and bandwidth consumption across several algorithms in different node networks. Notably, the proposed algorithm (HDAR-CCS) employs the best cluster member division scheme in three node networks for comparison with the other two algorithms.

[See PDF for image]

Fig. 12

Comparison results of different TT algorithms

[See PDF for image]

Fig. 13

Comparison of bandwidth consumption of different algorithms

As shown in Figs. 12 and 13, we can find that compared with the other two methods, the HDAR-CCS method proposed in this paper has the smallest TT and the lowest bandwidth consumption in the edge network of different nodes. Because HDAR-CCS takes into account factors such as data aggregation time and available bandwidth of node links in the selection of cluster centers, the division of cluster members, and the construction of data aggregation models, and finally forms a systematic edge data aggregation and reporting method. Therefore, whether from the experimental results or the theoretical basis, the hierarchical distributed edge data aggregation and reporting method based on cluster center selection proposed in this paper has excellent performance.

Conclusion

In this paper, we propose a hierarchical distributed edge data aggregation and reporting method based on cluster center selection to address two key challenges: aggregating data stored in edge data center into cluster center and selecting appropriate edge data centers as aggregation centers. The proposed scheme optimizes two primary goals: First, it enhances the data aggregation and reporting process. By combining computing and network resources, we minimize data aggregation time while ensuring low bandwidth consumption. We employ a distributed data aggregation strategy, dividing the aggregation process into multiple stages where data distributed across various locations is processed in parallel at each stage. Second, we optimize the selection of cluster centers and the division of clusters. We establish a judgment metric as the criterion for selecting the initial cluster center and design a selection scheme for the remaining cluster centers based on distance comparisons between the centers. Additionally, we implement a method for dividing cluster members for each cluster center. Experimental results demonstrate that the method proposed in this paper outperforms existing algorithms and effectively meets the requirements of intelligent applications for low latency, high bandwidth, and efficient processing.

Afterward, our research focus will remain on optimizing the data edge aggregation and reporting method proposed in this article. This includes filtering the data collected on the device side, removing redundant information, storing the data in the edge network center, and ultimately completing the edge aggregation and reporting process. By continuously optimizing the method, the method will move towards a more intelligent direction.

Acknowledgements

This work was supported by the National Nature Science Foundation of China (Nos. 61931004) and Jiangsu Innovation & Entrepreneurship Group Talents Plan.

Author contributions

Wensheng Yang contributed to the conception of the study, performed the experiment, performed the data analyses and wrote the manuscript; Chengsheng Pan contributed significantly to the preparation and revision of the manuscript, and provided the funding support.

Data availability

The data supporting this study’s findings are available from the corresponding author upon reasonable request.

Declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Kaur, M; Munjal, A. Data aggregation algorithms for wireless sensor network: a review. Ad Hoc Netw; 2020; 100, [DOI: https://dx.doi.org/10.1016/j.adhoc.2020.102083] 102083.

2. Kumar, A; Sato, Y; Oishi, T; Ono, S; Ikeuchi, K. Improving GPS position accuracy by identification of reflected GPS signals using range data for modeling of urban structures. Seisan Kenkyu; 2014; 66, pp. 101-107. [DOI: https://dx.doi.org/10.11188/seisankenkyu.66.101]

3. Kumar, A; Banno, A; Ono, S; Oishi, T; Ikeuchi, K. Global coordinate adjustment of the 3D survey models under unstable GPS condition. Seisan Kenkyu; 2013; 65, 2 pp. 91-95. [DOI: https://dx.doi.org/10.11188/seisankenkyu.65.91]

4. Faizan Ullah, M; Imtiaz, J; Maqbool, K. Enhanced three layer hybrid clustering mechanism for energy efficient routing in iot. Sensors; 2019; [DOI: https://dx.doi.org/10.3390/s19040829]

5. Gupta, GP; Misra, M; Garg, K. Towards scalable and load-balanced mobile agents-based data aggregation for wireless sensor networks. Comput Electr Eng; 2017; [DOI: https://dx.doi.org/10.1016/j.compeleceng.2017.10.020]

6. Zhu, T; Li, J; Gao, H; Li, Y. Data aggregation scheduling in battery-free wireless sensor networks. IEEE Trans Mob Comput; 2020; PP, 99 pp. 1-1. [DOI: https://dx.doi.org/10.1109/TMC.2020.3035671]

7. Zhang, ZM; Yang, W; Wu, FY; Li, P. Privacy and integrity-preserving data aggregation scheme for wireless sensor networks digital twins. J Cloud Comput-Adv Syst Appl; 2023; 12, 140. [DOI: https://dx.doi.org/10.1186/s13677-023-00522-7]

8. Sasirekha, S; Swamynathan, S. Cluster-chain mobile agent routing algorithm for efficient data aggregation in wireless sensor network. J Commun Netw; 2017; 19, 4 pp. 392-401. [DOI: https://dx.doi.org/10.1109/JCN.2017.000063]

9. Dhanaraj, RK; Lalitha, K; Anitha, S; Chandra, SK; Gupta, P; Goyal, MK. Hybrid and dynamic clustering based data aggregation and routing for wireless sensor networks. J Intell Fuzzy Syst Appl Eng Technol; 2021; [DOI: https://dx.doi.org/10.3233/JIFS-201756]

10. Sankaralingam, SK; Nagarajan, NS; Narmadha, AS. Energy aware decision stump linear programming boosting node classification based data aggregation in WSN. Comput Commun; 2020; 155, Apr. pp. 133-142. [DOI: https://dx.doi.org/10.1016/j.comcom.2020.02.062]

11. Wu, HQ; Wang, L; Xue, G. Privacy-aware task allocation and data aggregation in fog-assisted spatial crowdsourcing. IEEE Trans Netw Sci Eng; 2019; PP, 99 pp. 1-1. [DOI: https://dx.doi.org/10.1109/TNSE.2019.2892583]

12. Bou, S; Kitagawa, H; Amagasa,. T. Cpix: real-time analytics over out-of-order data streams by incremental sliding-window aggregation. IEEE Trans Knowl Data Eng; 2021; PP, 99 pp. 1-1. [DOI: https://dx.doi.org/10.1109/TKDE.2021.3054898]

13. Naghibi, M; Barati, HS. secure hybrid structure data aggregation method in wireless sensor networks. J Ambient Intell Humaniz Comput; 2021; [DOI: https://dx.doi.org/10.1007/s12652-020-02751-z]

14. Long, NB; Tran-Dang, H; Kim, DS. Energy-aware real-time routing for large-scale industrial internet of things. IEEE Internet Things J; 2018; 5, 3 pp. 2190-2199. [DOI: https://dx.doi.org/10.1109/JIOT.2018.2827050]

15. Cheng, L; Wang, Y; Liu, Q; Epema, DHJ; Murphy, J. Network-aware locality scheduling for distributed data operators in data centers. IEEE Trans Parallel Distrib Syst; 2021; PP, 99 pp. 1-1. [DOI: https://dx.doi.org/10.1109/TPDS.2021.3053241]

16. Ullah, I; Youn, HY. Efficient data aggregation with node clustering and extreme learning machine for wsn. J Supercomput; 2020; [DOI: https://dx.doi.org/10.1007/s11227-020-03236-8]

17. Li, Y; Chen, S; Zhao, C; Lu, W. Layered data aggregation with efficient privacy preservation for fog-assisted IIoT. Int J Commun Syst; 2020; [DOI: https://dx.doi.org/10.1002/dac.4381]

18. Joshi, P; Raghuvanshi, AS; Kumar, S. An intelligent delay efficient data aggregation scheduling for distributed sensor networks. Microprocess Microsyst; 2022; 93, [DOI: https://dx.doi.org/10.1016/j.micpro.2022.104608] 104608.

19. Liu, Z; Yuan, XM; Yuan, J; Zhang, JW; Gu, ZQ; Zhang, L. Multi-stage geo-distributed data aggregation with coordinated computation and communication in edge compute first networking. J Lightw Technol; 2023; 41, 8 pp. 2289-2300. [DOI: https://dx.doi.org/10.1109/JLT.2022.3232840]

20. Singh VK, Verma S, Kumar M (2016) Privacy preserving in-network aggregation in wireless sensor networks. In: The 11th international conference on future networks and communications, pp 216–223. https://doi.org/10.1016/j.procs.2016.08.034

21. Jia, L; Hua, QS; Fan, H; Wang, Q; Hai, J. Efficient distributed algorithms for holistic aggregation functions on random regular graphs. Sci China Inf Sci; 2022; 65, 4265935 [DOI: https://dx.doi.org/10.1007/s11432-020-2996-2] 152101.

22. Zhang, J; Zhu, J; Jia, Z; Yan, X. A secret confusion based energy-saving and privacy-preserving data aggregation algorithm. Chin J Electron; 2018; 26, 4 pp. 740-746. [DOI: https://dx.doi.org/10.1049/cje.2016.08.031]

23. Wu, Q; Zhou, F; Xu, J; Feng, D. Lightweight and verifiable secure aggregation for multi-dimensional data in edge-enhanced Iot. Comput Netw; 2023; 237, Dec. pp. 1.1-1.13.

Word count: 6643

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Hierarchical distributed edge data aggregation and reporting method based on cluster center selection

Content area

Abstract

Full text