Content area
In large-scale data processing, graph analytics of complex interaction networks are indispensable. As the whole graph processing and analytics can be inefficient and usually impractical, graph sampling by keeping a portion of the original graph becomes a favorable approach. While prior work focused on fixed edge and node selection strategy based on predetermined criteria, without adaptive feedback to adjust the sampling process, this type of sampling algorithms has limited flexibility and estimation accuracy for complex graphs. In this paper, we propose an adaptive graph sampling framework, and design AdapES, an adaptive edge sampling algorithm based on this framework. Compared to non-adaptive sampling methods, our approach can continually monitor the difference between the current sampled subgraph and the original graph, and dynamically adjust the edge sampling probability based on this observed sampling difference. Guided by a preset sampling goal, this algorithm automatically adapts to the fluctuations in the random sampling process with high flexibility. The experimental evaluation in 11 datasets demonstrates that AdapES outperforms other algorithms for preserving various graph properties and statistics.
Introduction
With the development of big data and cloud computing, analyzing large-scale data with complex interactions has become a key aspect of data processing and analytics (Gonzalez et al. 2014; Sahu et al. 2020; Fan 2022; Fan et al. 2021; Nguyen et al. 2013; Sakr et al. 2021; Chen et al. 2021a; Pandey et al. 2021; Ahmed et al. 2014, 2013; Yang et al. 2022; Hoang et al. 2021). As graphs are often used for modeling complex interaction networks, graph analytics are indispensable. Graphs from web graphs to social networks are processed to extract valuable information such as web page ranking and online user behavior pattern by analyzing graph properties and features (Nguyen et al. 2013; Ahmed et al. 2014, 2013; Yang et al. 2022; Hoang et al. 2021; Tan et al. 2021; Swift et al. 2022; Nakajima and Shudo 2022). However, as the volume and complexity of graphs grow, the whole graph processing is inefficient and usually impractical. Finding a representative subgraph which can effectively preserve the properties of the original graph becomes a more favorable strategy for solving graph processing problems. Graph sampling has been used in graph pattern mining (Zhao et al. 2020; Zhu et al. 2023; Preti et al. 2023), streaming graph processing (Ahmed et al. 2014, 2013; Shin et al. 2020; Leskovec and Faloutsos 2006; Mariappan et al. 2021), graph visualization (Gove 2019), graph representation learning (Zheng et al. 2020; Zeng et al. 2020; Abu-El-Haija et al. 2023; Wan et al. 2022; Zhang et al. 2022; Cong et al. 2020), and other fields (Nakajima and Shudo 2022; Choe et al. 2022; Gao et al. 2021; Bera and Seshadhri 2020; Imola et al. 2022). Prior work attempted to sample graphs through fixed edge and node selection strategy without adaptive feedback. However, given a large graph of complex interconnections, especially for streaming graphs, those sampling algorithms have limited flexibility and effectiveness to sample a subgraph preserving the key properties.
To address this challenge, we propose an adaptive graph sampling framework consisting of three components: goal, selection and adaptation. Based on this framework, we design and implement AdapES, an adaptive edge sampling algorithm. This algorithm sets a goal of matching a global graph property, which is the average node degree in this paper, and dynamically adjusts the sampling process guided by the goal, to generate a subgraph approximating to the original graph. In this approach, the probability of the edge selection is proportional to the edge weight, which is defined using the incident node degree and a parameter . Based on the observed difference between the sampled graph and the original graph, this approach dynamically changes to adjust the edge sampling probability to minimize the sampling error. For sampling the real graph represented in a collection of edges, this adaptive edge sampling method applies a sampling ratio based on the edge count to generate a sample graph of size proportional to the sampling ratio compared to methods using the node count. This adaptive sampling algorithm is flexible and effective in finding a subgraph that preserves graph properties. The experimental evaluation in 11 real-world datasets demonstrates that the proposed approach outperforms other graph sampling methods over a range of sampling ratios.
The rest of this paper is organized as follows. Related work is summarized and discussed in Sect. 2. Section 3 describes the design and implementation of our proposed approach. Experimental evaluation results are presented and analyzed in Sect. 4. Finally, Sect. 5 concludes this paper.
Related work
Prior work of graph sampling focused on algorithms using various strategies to find a representative subgraph (Nguyen et al. 2013; Ahmed et al. 2014, 2013; Nakajima and Shudo 2022; Swift et al. 2022; Leskovec and Faloutsos 2006; Choe et al. 2022; Gao et al. 2021; Zhao et al. 2020; Zhu et al. 2023; Gove 2019; Li et al. 2019; Yang et al. 2019; Liu et al. 2019). From the category of methodologies, there are random node sampling, random edge sampling, exploration-based sampling and others. Based on those methods, a significant amount of previous works have proposed optimized sampling algorithms and implementations for various purposes including graph processing (Ahmed et al. 2014, 2013; Choe et al. 2022; Li et al. 2019; Yang et al. 2019; Rozemberczki et al. 2020), graph representation learning (Zeng et al. 2020; Abu-El-Haija et al. 2023) and others (Swift et al. 2022; Zhao et al. 2020; Zhu et al. 2023; Tětek and Thorup 2022; Ben-Eliezer et al. 2022; Alev and Lau 2020). These graphs can be static graphs or streaming graphs.
PIES (Ahmed et al. 2014, 2013) is an edge sampling algorithm for streaming graph sampling. This approach maintains a dynamic sample, while the new edge of the graph is streaming. It initially creates a reservoir of the nodes from the stream, and updates the reservoir by randomly replacing the existing nodes with new nodes in the stream. PIES iterates the stream in one pass, deterministically adds the initial edges incident to n nodes (determined by the sample size). After it reaches the sample size n, it probabilistically adds the nodes incident to the streaming edge and replaces the nodes in the reservoir with these incident nodes. Given the nature of edge sampling, PIES is biased to select high-degree nodes due to the initial step of edge sampling. From the evaluation results, PIES is more difficult to capture the properties of dense graphs than sparse graphs.
MiDaS, Minimum Degree Biased Sampling of Hyperedges (Choe et al. 2022), is a sampling algorithm for representative sampling of hypergraphs. It aims to improve Random Hyperedge Sampling (RHS) which suffers from weak connectivity by sampling toward high-degree nodes. To better preserve the node degree, it utilizes a weight function for each hyperedge to control the subhypergraph sample bias in the degree distribution. As the hyperedge selection probability is proportional to a function of hyperparameter , the sampling result to preserve the degree distributions is controlled by this parameter. In this algorithm, parameter needs to be provided as a constant value. To find an appropriate , it tunes using a linear regressor, and applies hill-climbing search for a better value in the discrete search space. MiDaS needs this tuning and search process to find , which can be an expensive cost. Moreover, this method is limited to be applied for hypergraphs, which is different from graph sampling scenarios.
Little Ball of Fur (Rozemberczki et al. 2020) is an open source graph sampling framework implemented in Python. In this Python library, it includes the implementation of various existing graph sampling algorithms in types of node sampling, edge sampling and exploration-based sampling. The framework is built on widely used open source Python packages NetworkX (Hagberg et al. 2008) and NetworKit (Staudt et al. 2016) for network analysis, manipulation and study. This library provides an implementation reference of graph sampling and analysis.
For graph representation learning, the process of training Graph Neural Networks (GNNs) often involves sampling a smaller subgraph to enhance the training efficiency. GraphSAINT (Zeng et al. 2020) has integrated several efficient samplers (random node, random edge, random walk) to sample the training graph to alleviate “neighbor explosion” during large graph training. This approach extracts connected subgraphs which are combined together, so graph training can learn good representation overall. In the edge sampler, the optimal edge probability is inversely proportional to the connected node degree in order to select more edges connecting nodes with few neighbors as these nodes are likely to have high influence on each other.
SubMix (Abu-El-Haija et al. 2023) is a tree-based sampling method for robust GNNs training. It performs level-wise sampling by selecting neighbors of the initial seed node, and subsequent neighbors of previous sampled nodes. This approach mixes different sampling heuristics with trainable weights of components assigning neighborhood distribution functions. As a parameterized sampling method, it contains various samplers for sampling neighbors, such as that samples neighbors based on the probability proportional to node PageRank value. While these sampling methods are proposed to improve GNNs training, this goal differs from finding sampled subgraphs that preserve desired graph properties.
Along with proposing optimized graph sampling algorithms and implementations, there are research efforts contributed to applying graph sampling for graph analysis and graph mining. Arya (Zhu et al. 2023) is a graph pattern mining system for graph pattern approximation by graph pattern decomposition and edge sampling. TETRIS (Bera and Seshadhri 2020) is an algorithm for triangle count estimation through random incident sampling. It utilizes a random walk for an edge set, from which it samples edges with a probability proportional to the edge degree ( for edge ), then samples random neighbors from edges, and estimates the triangle count.
Compared to prior studies of graph sampling, this paper proposes a novel approach of adaptive graph sampling to improve the representativeness of the sampled subgraph by dynamically adjusting the sampling process with high flexibility.
Adaptive graph sampling
To find a representative subgraph of a large graph, random sampling is applied toward a statistically unbiased estimator. One way is to design a random sampling approach, and select edges and nodes by following the sampling process based on devised strategy which could be random node selection, random edge selection, exploration-based selection, or others. However, given the complexity of the large graphs, achieving promising sampling results through this approach is a challenge.
To address this challenge, we propose an adaptive graph sampling framework to generate a more representative subgraph for a complex graph. The sampling process is dynamically adjusted based on the difference between the current sample graph and the original graph, in order to generate a subgraph matching the graph properties of the original graph. This adaptive sampling framework consists of three components:
Goal Find a global graph property correlated with the graph properties for graph representativeness. Set a goal which is matching this global graph property during the sampling process.
Selection Design a basic edge and node selection strategy for the graph sampling, which can preserve the graph topological properties.
Adaptation Design an adaptive sampling approach based on the basic selection strategy to dynamically adjust the sampling process toward matching the global graph property, which serves as the sampling goal.
Graph properties
We introduce and analyze graph properties and statistics used to measure the representativeness of sampled subgraphs. These graph statistics include degree distribution, clustering coefficient distribution and k-core distribution. They are important measures for capturing the connectivity, transitivity, density and community structure of the graph, and are widely used in graph and network analysis (Ahmed et al. 2013; Chen et al. 2021b; Trolliet et al. 2022; Van Koevering et al. 2021; Hong and Lu 2020; You et al. 2020). From the definitions, these statistics are correlated with the average node degree. As an important graph property, the average node degree is selected as the global graph property, which is defined in Eq. (1) for the original graph . Matching the average node degree is set as the goal for this adaptive sampling approach.
1
Degree distribution. The degree distribution is an important measure for the graph connectivity (Chen et al. 2021b). The ratio of nodes with degree d is defined in Eq. (2), where N is the node count.2
The average node degree defined in Eq. (1) is the expected value E[deg] of node degrees. As shown in Eq. (3), where is the set of degrees, the degree distribution has close correlation with the average node degree.3
Clustering coefficient distribution. The ratio of nodes with clustering coefficient () c is defined in Eq. (4).4
The clustering coefficient of vertex v is the ratio of the number of observed triangles centered with vertex v over the number of all possible triangles, which is defined in Eq. (5), where N(v) denotes the set of neighbors of vertex v. The clustering coefficient is an important measure for graph transitivity and clustering (Trolliet et al. 2022). It is determined based on the node degree, and is correlated with the average node degree.5
K-core distribution. The ratio of nodes contained in a k-core of order () k is defined in Eq. (6).6
A k-core is the maximal induced subgraph of nodes with minimum degree k, defined in Eq. (7), where is a subgraph of G.7
The k-core is an important measure for connectivity, community structure and density/sparsity of the graph (Van Koevering et al. 2021). It is also correlated with the average node degree as it is defined based on the node degree.Adjacent edge sampling
For large complex graphs, such as social networks, there are more edges than nodes. From the perspective of sample graph data size, sampling based on the edge count instead of the node count can generate the sample graph of size proportional to the sampling ratio. For real graphs formatted in the edge list, the size of the graph data is proportional to the number of edges. The data size of subgraph sampled at the ratio of the edge count is proportional to the sampling ratio, in contrast to sampling by the ratio of node count.
In our approach, an edge-based sampling approach is designed. This approach applies adjacent edge selection to keep neighboring edges and nodes during the sampling process. The sampling algorithm is described in Algorithm 1. Initially, adjacency lists of edges are precomputed for efficiently finding the adjacent edges during the sampling process. In each iteration, it randomly selects a batch of k edges with equal probability, and then adds the adjacent edges of these edges. The set of adjacent edges of edge is defined in Eq. (8). After that, the sample graph is generated from these edges. This algorithm is a one-pass approach.
8
The time complexity of Algorithm 1 is O(|E|). It takes O(|E|) time to calculate the adjacency lists of all edges. Then, in each iteration, it adds k edges and their adjacent edges. It has at most edge selections and takes time in total. So, the final time complexity is O(|E|).The adjacent edge sampling algorithm can preserve topology information locally, but it may lead to biased results during random edge sampling. In this edge sampling approach, it generates subgraphs by selecting edges. It is biased to select high-degree nodes, as proved in Lemma 1.
Lemma 1
Subgraph from adjacent edge sampling is biased to have a higher ratio of high-degree nodes.
Let be the ratio of nodes with degree k in the original graph, be the ratio of nodes with original degree k in the sample graph. The average degree of nodes in the original graph . Then, for .
Proof
Let be the number of nodes with degree k in the original graph , which is defined in Eq. (9).
9
Ratio is defined in Eq. (10), N is the node count of the original graph.10
The expected value of can be calculated in Eq. (11), where is the probability of selecting a node with degree k, and n is the node count of the sample graph.11
Then it can prove for .The ratio of high-degree nodes with degree is expected to be higher in the sample graph. This bias can balance the underestimation of node degree in sampled subgraphs.
Adaptive edge sampling
Based on the adjacent edge sampling, an adaptive sampling approach is proposed. Adaptive edge sampling (AdapES) is based on the preset goal of matching the global graph property (i.e., average node degree). In our approach, the average node degree difference between the sample graph and the original graph is measured, and the sampling process is dynamically adjusted based on the difference. For this approach, an edge is selected with a probability determined by the edge degree defined in Eq. (12) and parameter .
12
Specifically, the edge is randomly selected with a probability calculated in Eq. (13). Through dynamically changing , the edge selection probability is updated. Thus, the edge sampling process is adjusted.13
Here |E| denotes the number of edges, is the edge selection weight defined in Eq. (14). As the probability of selecting an edge is calculated based on the edge degree, it has a higher probability of selecting high degree edges when α > 0. As shown in these equations, the edge selection probability is determined based on .14
The adaptive edge sampling approach dynamically changes to adjust the edge selection probability, in order to match the average node degree of the original graph. It is based on the correlation between and the edge selection probability. As is increased, the probability of selecting the high-degree edge is higher, as proved in Lemma 2.Lemma 2
The edge selection probability used in adaptive edge sampling is an increasing function of for high-degree edges.
Let be the edge selection probability for edge e, the derivative for edge e if edge degree satisfies
15
Proof
Let be the degree of edge e defined in Eq. (12) for the original graph . The selection probability for edge e is defined based on and in Eq. (16).
16
The derivative of can be calculated in Eq. (17).17
Then it can prove if satisfies Eq. (15).Subgraph sampled with larger is biased to have a higher ratio of high-degree nodes than the subgraph sampled with smaller , toward larger average node degree. The correlation between and average node degree is observed and shown in Fig. 2 of Sect. 4.1.2.
Our adaptive edge sampling algorithm AdapES is shown in Algorithm 2. Initially, adjacency lists of edges and edge weights with different are precomputed. Edge weights over a range of are precomputed in Algorithm 3 for efficient edge weight selection of a specific in sampling iterations. AdapES selects edges iteratively. In each iteration, a batch of k edges are randomly selected with probabilities proportional to the edge weights calculated in Eq. (14), and then the adjacent edges of these edges are added, as defined in Eq. (8). After that, the average node degree of the current subgraph is calculated. If is larger than the average node degree avgdeg of the original graph, is decreased to reduce the weight of high-degree edge in the next iteration in order to select fewer high-degree edges and more low-degree edges for a lower average node degree globally. If is smaller than avgdeg, is increased for updating the edge selection probability in the next iteration for a higher average node degree globally. From these selected edges, a sample graph is generated.
The value of is within a range , which is set to . To control the change rate of , we use a step size parameter which is set to 0.25. Further, to avoid excessively frequent adjustment, a ratio threshold is used, is updated when the difference ratio is above or equal to the threshold , which is set to 0.02.
The values of these parameters are determined empirically and can be adjusted to tune the performance of the algorithm. The range of should be sufficiently large to allow for changes. The ratio threshold also affects change and should be adjusted to tune the sensitivity of adjustment. Step size controls the change rate of , should be set appropriately based on the range of and the graph size that determines the number of sampling iterations. Moreover, should be adjusted by considering the value of k, which is the number of batch edges initially selected in each iteration. Empirically, as fewer adjustment iterations are expected for a larger k, a larger step size is used to avoid a low change rate of .
The time complexity of Algorithm 2 is . Initial calculation of edge weight under a range of takes time. As , and are all constants, the time complexity is O(|E|), which is the time complexity of Algorithm 3. For edge selection iterations, it has at most edge selections and adds edges in total. In each edge selection, it takes O(log|E|) time to select the edge based on the edge weight by using Python random module. The total time is . So, the final time complexity of this algorithm is .
Implementation
In the proposed sampling approach, graph generation is implemented using NetworKit (Staudt et al. 2016). NetworKit is an open-source software package for analyzing the structure of large complex networks. The kernels are written in C++ while providing an interface in Python for performance and usability. Using this library can improve the performance of graph sampling and graph analytics.
Algorithm optimization. Adaptive edge sampling is implemented based on the adjacent edge sampling. Adjacency lists of edges are precomputed for efficiently finding adjacent edges in sampling iterations. Adaptive edge sampling dynamically adjusts the edge selection weights through changing . For efficient selection of the edge weight under a specific in adjustment iterations, these edge weight lists over a range of are precomputed. Further, for edge selection, a batch of k edges are randomly selected through applying Python itertools module for efficient looping. This adaptive edge sampling is a one-pass algorithm. With dynamically updating adjacency lists, it can be applied to streaming graph sampling.
Graph data. For the edge sampling approaches, the input graph data is provided as a collection of edges. It can support different file formats such as txt and csv. To facilitate the graph analytics of graph properties and metrics, in this adaptive sampling approach AdapES, the generated sample graph is exported as a graph data file in a collection of edges, which is saved locally.
Evaluation
In this section, we evaluate our proposed approach by comparing the sampled graphs from our sampling algorithm AdapES and other sampling algorithms of various types as described in Sect. 4.2. These experimental evaluations are run on a machine with 2 sockets of Xeon E5-2650v4 12 core CPU and 128 GB DRAM memory. To compare the results on different graphs, we use 11 real-world graph datasets listed in Table 1 (Leskovec and Krevl 2014). For a sampling algorithm, distribution distances of graph properties between the sampled subgraphs over sampling ratios 0.1–0.5 and the original graph are evaluated.
Among these 11 datasets, there are graphs of social networks, collaboration networks and email networks from SNAP online network data (Leskovec and Krevl 2014), which are widely used in the evaluation of graph analytics (Hu et al. 2023; Jin et al. 2023; Rozemberczki et al. 2020; Jangda et al. 2021). These graphs are undirected without edge weights and self-loops. The datasets store the graph in the format of edge list. The basic statistics of the graph datasets can be found in Table 1.
Table 1. Graph Datasets
Graph | Vertices | Edges | Type |
|---|---|---|---|
FacebookPage | 22,470 | 171,002 | Page-page network |
LastFMAsia | 7,624 | 27,806 | Social network |
DeezerEurope | 28,281 | 92,752 | Social network |
GitHub | 37,700 | 289,003 | Social network |
EnronEmail | 36,692 | 183,831 | Email network |
Brightkite | 58,228 | 214,078 | Location social network |
Gowalla | 196,591 | 950,327 | Location social network |
AmazonProduct | 334,863 | 925,872 | Product network |
DBLP | 317,080 | 1,049,866 | Collaboration network |
Youtube | 1,134,890 | 2,987,624 | Social network |
LiveJournal | 3,997,962 | 34,681,189 | Social network |
Analysis of adaptive edge sampling
Adaptive edge sampling is based on adjacent edge sampling, an edge sampling approach. This approach samples edges by the ratio of the edge count and can generate the sampled graph of size proportional to the sampling ratio. During the sampling process, adaptive sampling dynamically adjusts the edge selection probability by applying different toward the sampling goal: matching the average node degree.
Sample graph size over sampling ratio
By sampling edges in the ratio of the edge count, the sample graph size is proportional to the sampling ratio, which reflects the ratio of the sample graph size over the full graph size. In Fig. 1, AdapES algorithm is applied to generate sample graphs on these datasets. Here, sample/full graph ratio is calculated as the ratio of the sample graph size at a sampling ratio over the full graph size. It can be observed that it is linear to the sampling ratio.
Fig. 1 [Images not available. See PDF.]
Correlation between sample graph size and sampling ratio
A sample graph is a portion of the original graph. As observed in Fig. 1, the sample/full graph ratio is smaller than the sampling ratio. On EnronEmail, Brightkite and Gowalla datasets, this ratio is smaller than 1/2 of the sampling ratio, which results from the shorter vertex id strings in the sampled graph. For a sample graph at sampling ratio x, the sample graph size is smaller than x of the full graph size. Compared to graph analytics on full graphs, it is more efficient to conduct graph analytics on sample graphs as the cost of graph computation is lower on smaller subgraphs.
Effect of on matching average node degree
For the adaptive sampling approach, a sampling goal is set, and the sampling algorithm proceeds toward this goal. In the proposed adaptive edge sampling algorithm AdapES, this goal is to match the average node degree.
During the iterations of adaptive edge sampling, parameter is increased or decreased to dynamically adjust the edge selection probability defined by the edge degree and , in order to match the average node degree of the original graph. Figure 2 shows the trace of average node degree changes due to adjustment in AdapES on Youtube dataset (sampling ratio 0.2 is selected as a representative ratio). With the adjustment of , the average node degree converges to a narrow range 5.1–5.4 in a short period of time, and finally reaches 5.253 with a tiny gap with the average node degree 5.265 of the original graph. As observed from this figure, there is a positive correlation between and the average node degree. The value of average node degree goes up and down because of the increase and decrease in .
It can be observed from this figure, the values of are dynamically adjusted in an efficient manner. In the initial stage when the average node degree is far from the aimed value, is updated frequently for the fast convergence of . After converges to a narrow range, the adjustment becomes less frequent. Within this process, the appropriate selection of step size and threshold of the degree difference also contributes to the efficiency.
Fig. 2 [Images not available. See PDF.]
Average node degree and trace in AdapES on Youtube dataset at sampling ratio 0.2
Comparison of sampling algorithms
To evaluate the sampling results, we compare the performance of our proposed algorithm with various sampling algorithms as listed in Table 2, including types of node selection-based sampling, edge selection-based sampling, and exploration-based sampling. Details of these algorithms are described as follows:
Table 2. Sampling Algorithms
Name | Algorithm | Type |
|---|---|---|
DaS | Degree-Biased Sampling | Edge |
PIES | Partially Induced Edge Sampling | Edge |
GSES | GraphSAINT Edge Sampling | Edge |
NS | Random Node Sampling | Node |
SubMix | Mix Subgraph Sampling | Exploration |
RW | Random Walk Sampling | Exploration |
AdjES | Adjacent Edge Sampling | Edge |
AdapES | Adaptive Edge Sampling | Edge |
DaS is degree-biased edge sampling. It randomly selects an edge with the probability proportional to the edge weight calculated in Eq. (18), which is the minimum of vertex degrees, and generates the sample graph from these edges. This algorithm references the idea of MIDAS algorithm (Choe et al. 2022).
18
PIES is partially induced edge sampling. We implemented this approach based on the algorithm PIES (Ahmed et al. 2013), which creates a reservoir of the nodes from the stream and updates the reservoir by randomly replacing the existing nodes with new incident nodes. We modified this algorithm to sample at the sampling ratio calculated by the edge count instead of the node count.
GSES is GraphSAINT edge sampling (Zeng et al. 2020). It randomly selects edges with the probability calculated in the degrees of connected nodes as shown in Eq. (19) and generates the induced subgraph as the sample graph. The sampling ratio is calculated by the number of edges.
19
NS is random node sampling (Rozemberczki et al. 2020; Stumpf et al. 2005). It randomly selects nodes with equal probability and generates the induced subgraph as the sample graph. It is also modified to sample at the sampling ratio calculated by the edge count.
SubMix (Abu-El-Haija et al. 2023) is a tree-based sampling method. It randomly selects neighbors of the initial seed node and continues to select subsequent neighbors of previous sampled nodes. This approach performs level-wise sampling, mixes different samplers with component weights for sampling neighbors. It mixes components and (weight of is 0) for sampling neighbors with sampling budgets =20 and =10. For PageRank sampler , it samples neighbors with probability proportional to its pagerank value as shown in Eq. (20). Here and below, = 2.
20
For Least-degree sampler , it samples neighbors with probability inversely proportional to its degree as shown in Eq. (21). Here maximum degree = .
21
RW is random walk sampling (Rozemberczki et al. 2020; Gjoka et al. 2010). It starts from one random node, visits random neighboring nodes and adds all the edges during the visiting path, which are used to generate the sample graph.
AdjES is adjacent edge sampling based on edge sampling, the basic algorithm of the proposed approach. During sampling iterations, it randomly selects a batch of seed edges with equal probability and adds adjacent edges of the seed edges.
AdapES is adaptive edge sampling based on AdjES. In the edge selection, it randomly picks a batch of seed edges with probabilities proportional to the edge weight calculated as the function of the edge degree and , and adds adjacent edges of these seed edges. This approach dynamically adjusts the edge selection probability by changing , in order to match the average node degree of the original graph.
Evaluation metrics
In the experimental evaluation, the following graph properties and statistics are measured and compared between the sample graph and the original graph:
Degree distribution
Clustering coefficient distribution
K-core distribution
22
D-statistic
Kolmogorov-Smirnov (KS) D-statistic is widely used to measure the distribution distance, as computed in Eq. (23), which calculates the maximum vertical distance of two CDFs. All sampling algorithms are evaluated by using the D-statistics in degree distribution, clustering coefficient distribution and k-core distribution to compare the distance between sampled graphs and the original graph on all datasets. For each dataset, the D-statistic is calculated for sampling algorithms with sampling ratios ranging from 0.1 to 0.5.
23
Figures 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 and 13 show D-statistics for degree, clustering coefficient (clustering) and k-core distributions on 11 datasets. Overall, AdapES achieves the best performance among all 8 algorithms for graph degree, clustering coefficient and k-core. It steadily captures these graph properties regardless of sampling ratios. For 7 of total 11 datasets, AdapES outperforms others in all metrics. For the other 4 datasets, it underperforms with small gaps for clustering coefficient on GitHub and EnronEmail (sampling ratio 0.3–0.5) datasets, clustering coefficient and k-core (sampling ratio 0.2 and 0.3) on DBLP dataset, and degree on AmazonProduct dataset (sampling ratio 0.1–0.3).
For the degree distribution, AdapES outperforms other algorithms except SubMix on AmazonProduct dataset, where the gaps are less than 0.05. The D-statistics of AdapES are less than 0.1 for all 11 datasets, and are less than 0.05 for 8 datasets: FacebookPage, DeezerEurope, GitHub, EnronEmail, Brightkite, Gowalla, DBLP and Youtube. For FacebookPage and LastFMAsia datasets, D-statistics are much higher at sampling ratio 0.1 than other sampling ratios. It is caused by the small sample graph, on which AdapES has fewer iterations of edge selections to adapt to preserve the node degrees.
For the clustering coefficient distribution, AdapES also outperforms other algorithms. The D-statistics of AdapES are less than 0.3 for 10 datasets except GitHub. For GitHub dataset, the D-statistic of AdapES is 0.43 at sampling ratio 0.1, drops to 0.24 at sampling ratio 0.2 and is lower at other sampling ratios. This underperformance is caused by a smaller sample, where AdapES has fewer iterations of edge selections to adapt to preserve the node degree related graph properties such as clustering coefficient. For EnronEmail dataset, AdapES outperforms others at sampling ratios 0.1 and 0.2, but NS is better at sampling ratios 0.3–0.5. On DBLP dataset, AdjES and AdapES perform similarly with small gaps ().
For the k-core distribution, AdapES has the best performance among all algorithms. The D-statistics of AdapES are less than 0.2 for 10 datasets except AmazonProduct. For AmazonProduct dataset, the maximum D-statistic of AdapES is 0.28, which is still much smaller compared to other algorithms. On DBLP dataset, AdapES is outperformed by AdjES at sampling ratios 0.2 and 0.3, but the gaps are less than 0.06.
On large datasets Gowalla, AmazonProduct, DBLP, Youtube and LiveJournal, AdapES has better performance than on smaller datasets. Especially for Youtube dataset, D-statistics are less than 0.01 for degree distribution, 0.05 for clustering coefficient distribution and 0.04 for k-core distribution over all sampling ratios 0.1–0.5. For Youtube dataset, NS, DaS and SubMix also perform well with small gaps compared to AdapES, and PIES and RW perform similarly with them for the degree distribution.
When compared to AdjES, AdapES has consistently better performance, which is contributed by the adaptive sampling mechanism. Only in a few cases, AdjES has better performance. On DBLP dataset, AdjES performs better than AdapES in clustering coefficient and k-core at sampling ratios 0.2 and 0.3, but the gaps are small (0.03 for clustering coefficient and 0.06 for k-core). DaS is overall outperformed by AdapES, it has similar performance with AdapES only for degree distributions on EnronEmail and Youtube datasets, and clustering coefficient distribution on Youtube dataset. Similarly, NS outperforms AdapES only for clustering coefficient distributions on GitHub and EnronEmail (sampling ratios 0.3–0.5) datasets. SubMix outperforms AdapES only for degree distribution on AmazonProduct dataset (sampling ratios 0.1–0.3), while it outperforms GSES which is also used for sampling in GNNs because it captures more graph topology information when sampling neighbors.
Fig. 3 [Images not available. See PDF.]
FacebookPage dataset
Fig. 4 [Images not available. See PDF.]
LastFMAsia dataset
Fig. 5 [Images not available. See PDF.]
DeezerEurope dataset
Fig. 6 [Images not available. See PDF.]
GitHub dataset
Fig. 7 [Images not available. See PDF.]
EnronEmail dataset
Fig. 8 [Images not available. See PDF.]
Brightkite dataset
Fig. 9 [Images not available. See PDF.]
Gowalla dataset
Fig. 10 [Images not available. See PDF.]
AmazonProduct dataset
Fig. 11 [Images not available. See PDF.]
DBLP dataset
Fig. 12 [Images not available. See PDF.]
Youtube dataset
Fig. 13 [Images not available. See PDF.]
LiveJournal dataset
Distributions
To analyze details of the degree, clustering coefficient, k-core distributions, cumulative distribution functions (CDF) are plotted for sample graphs generated by all algorithms and the original graph (labeled as True) on Youtube and LiveJournal datasets. Sampling ratio 0.2 is selected as a representative ratio. As shown in Fig. 14, 15, 16, 17, 18 and 19, the results in Youtube and LiveJournal datasets demonstrate that AdapES outperforms others in all distributions.
For the degree distribution on Youtube dataset as shown in Fig. 14, AdapES captures high-degree and low-degree nodes. RW, NS, PIES overestimate the distribution by a small margin, and SubMix and DaS underestimate the distribution. Others fail to preserve the degree distribution. For the clustering coefficient distribution as shown in Fig. 15, AdapES preserves the distribution accurately. DaS captures the low clustering coefficients, but overestimates the high clustering coefficients by a small margin. NS overestimates the distribution. For the k-core distribution as shown in Fig. 16, AdapES preserves the k-core distribution more accurately compared to other methods. RW, NS overestimate the distribution by a small margin, SubMix and DaS underestimate the distribution.
For the degree distribution on LiveJournal dataset as shown in Fig. 17, AdapES overestimates the distribution by a small margin. Others fail to preserve the degree distribution. For the clustering coefficient distribution as shown in Fig. 18, AdapES preserves the distribution accurately overall, while AdjES has better results for high clustering coefficients. NS has similar performance with AdapES. For the k-core distribution as shown in Fig. 19, AdapES preserves the k-core distribution more accurately compared to other methods.
Overall, AdapES can sample the most representative subgraph that is used for graph analytics on a sample of the whole graph. Compared to other sampling methods, AdapES can generate a smaller subgraph of high similarity with the whole graph, graph analytics on a smaller sample can be more efficient regarding the graph sampling and processing overhead.
Fig. 14 [Images not available. See PDF.]
Degree distributions on Youtube dataset (sampling ratio = 0.2)
Fig. 15 [Images not available. See PDF.]
Clustering Coefficient distributions on Youtube dataset (sampling ratio = 0.2)
Fig. 16 [Images not available. See PDF.]
K-core distributions on Youtube dataset (sampling ratio = 0.2)
Fig. 17 [Images not available. See PDF.]
Degree distributions on LiveJournal dataset (sampling ratio = 0.2)
Fig. 18 [Images not available. See PDF.]
Clustering Coefficient distributions on LiveJournal dataset (sampling ratio = 0.2)
Fig. 19 [Images not available. See PDF.]
K-core distributions on LiveJournal dataset (sampling ratio = 0.2)
Sampling time
We measure the running times of all the sampling algorithms in large datasets Youtube and LiveJournal. As shown in Fig. 20a and b, AdapES is faster than PIES and SubMix, but slower than others. The time complexity of AdapES is , worse than the time complexity O(|E|) of NS, RW and AdjES. Compared to DaS and GSES, which have the same time complexity but can complete the random sampling process in one batch, AdapES has more sampling iterations to adapt to the sampling changes. This is the limit of this approach. This limit can be alleviated through further algorithm optimization and efficient Python module implementation. Although AdapES is not fast, it demonstrates cost efficiency when considering the outstanding sampling results it achieves in preserving graph properties.
Fig. 20 [Images not available. See PDF.]
Sampling times of sampling algorithms on Youtube and LiveJournal datasets
Conclusion
In this paper, we propose an adaptive graph sampling framework, design and implement an adaptive edge sampling algorithm AdapES based on the framework. With monitoring the sampling difference between the sampled subgraph and the original graph throughout the whole process of sampling, AdapES is able to automatically adjust the edge sampling probability for matching the average node degree, which is preset as the sampling goal. The idea of presetting a sampling goal and dynamically adjusting the sampling process contributes to the flexibility and effectiveness of this approach. Using 11 real-world datasets from thousands to millions of vertices and edges, the experimental evaluation demonstrates that our proposed approach can preserve the graph properties and statistics more accurately compared to other sampling algorithms of various types. In the subgraph sampled by this approach with high estimation accuracy, graph analytics can be more efficient. In the future work, we could consider alleviating its limit through further algorithm optimization and extending the adaptive graph sampling approach to support distributed execution in a cluster of machines for significant performance improvement.
Author contributions
KW contributed to the work of conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft, writing—review & editing, and visualization.
Declarations
Conflict of interest
The authors declare no competing interests.
References
Abu-El-Haija S, Fatemi B, Axiotis K, Bulut N, Gasteiger J, Dillon JV, Perozzi B, Bateni M (2023) Submix: learning to mix graph sampling heuristics. In: The 39th conference on uncertainty in artificial intelligence
Ahmed, NK; Neville, J; Kompella, R. Network sampling: from static to streaming graphs. ACM Trans Knowl Discov Data (TKDD); 2013; 8,
Ahmed NK, Duffield N, Neville J, Kompella R (2014) Graph sample and hold: a framework for big-graph analytics. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 1446–1455
Alev VL, Lau LC (2020) Improved analysis of higher order random walks and applications. In: Proceedings of the 52nd annual ACM SIGACT symposium on theory of computing, pp 1198–1211
Ben-Eliezer O, Eden T, Oren J, Fotakis D (2022) Sampling multiple nodes in large networks: beyond random walks. In: Proceedings of the fifteenth ACM international conference on web search and data mining, pp 37–47
Bera SK, Seshadhri C (2020) How to count triangles, without seeing the whole graph. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 306–316
Chen X, Tan H, Chen Y, He B, Wong WF, Chen D (2021a) ThunderGP: HLS-based graph processing framework on FPGAs. In: The 2021 ACM/SIGDA international symposium on field-programmable gate arrays, pp 69–80
Chen, Y; Huang, S; Zhao, L; Dissanayake, G. Cramér-rao bounds and optimal design metrics for pose-graph slam. IEEE Trans Robot; 2021; 37,
Choe M, Yoo J, Lee G, Baek W, Kang U, Shin K (2022) Midas: representative sampling from real-world hypergraphs. In: Proceedings of the ACM web conference, pp 1080–1092
Cong W, Forsati R, Kandemir M, Mahdavi M (2020) Minimal variance sampling with provable guarantees for fast training of graph neural networks. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1393–1403
Fan, W. Big graphs: challenges and opportunities. Proc VLDB Endow; 2022; 15,
Fan, W; He, T; Lai, L; Li, X; Li, Y; Li, Z; Qian, Z; Tian, C; Wang, L; Xu, J et al. Graphscope: a unified engine for big graph processing. Proc VLDB Endow; 2021; 14,
Gao, H; Liu, Y; Ji, S. Topology-aware graph pooling networks. IEEE Trans Pattern Anal Mach Intell; 2021; 43,
Gjoka M, Kurant M, Butts CT, Markopoulou A (2010) Walking in facebook: a case study of unbiased sampling of osns. In: 2010 Proceedings IEEE Infocom. IEEE, pp 1–9
Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I (2014) GraphX: graph processing in a distributed dataflow framework. In: 11th USENIX symposium on operating systems design and implementation (OSDI 14), pp 599–613
Gove, R. A random sampling O (n) force-calculation algorithm for graph layouts. Computer graphics forum; 2019; Wiley Online Library: pp. 739-751.
Hagberg A, Swart P, Chult DS (2008) Exploring network structure, dynamics, and function using NetworkX. Tech. rep., Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Hoang, L; Dathathri, R; Gill, G; Pingali, K. Cusp: a customizable streaming edge partitioner for distributed graph analytics. ACM SIGOPS Oper Syst Rev; 2021; 55,
Hong SH, Lu S (2020) Graph sampling methods for big complex networks integrating centrality, k-core, and spectral sparsification. In: Proceedings of the 35th annual ACM symposium on applied computing, pp 1843–1851
Hu, Z; Zheng, W; Lian, X. Triangular stability maximization by influence spread over social networks. Proc VLDB Endow; 2023; 16,
Imola J, Murakami T, Chaudhuri K (2022) Communication-Efficient triangle counting under local differential privacy. In: 31st USENIX security symposium (USENIX Security 22), pp 537–554
Jangda A, Polisetty S, Guha A, Serafini M (2021) Accelerating graph sampling for graph machine learning using GPUs. In: Proceedings of the sixteenth European conference on computer systems, pp 311–326
Jin, T; Li, B; Li, Y; Zhou, Q; Ma, Q; Zhao, Y; Chen, H; Cheng, J. Circinus: fast redundancy-reduced subgraph matching. Proc ACM Manag Data; 2023; 1,
Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 631–636
Leskovec J, Krevl A (2014) SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data
Li Y, Wu Z, Lin S, Xie H, Lv M, Xu Y, Lui JC (2019) Walking with perception: efficient random walk sampling via common neighbor awareness. In: 2019 IEEE 35th international conference on data engineering (ICDE), IEEE, pp 962–973
Liu P, Benson AR, Charikar M (2019) Sampling methods for counting temporal motifs. In: Proceedings of the twelfth ACM international conference on web search and data mining, pp 294–302
Mariappan M, Che J, Vora K (2021) Dzig: sparsity-aware incremental processing of streaming graphs. In: Proceedings of the sixteenth European conference on computer systems, pp 83–98
Nakajima K, Shudo K (2022) Social graph restoration via random walk sampling. In: 2022 IEEE 38th international conference on data engineering (ICDE). IEEE, pp 1–14
Nguyen D, Lenharth A, Pingali K (2013) A lightweight infrastructure for graph analytics. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles, pp 456–471
Pandey P, Wheatman B, Xu H, Buluc A (2021) Terrace: a hierarchical graph container for skewed dynamic graphs. In: Proceedings of the 2021 international conference on management of data, pp 1372–1385
Preti, G; De Francisci, MG; Riondato, M. Maniacs: approximate mining of frequent subgraph patterns through sampling. ACM Trans Intell Syst Technol; 2023; 14,
Rozemberczki B, Kiss O, Sarkar R (2020) Little ball of fur: a python library for graph sampling. In: Proceedings of the 29th ACM international conference on information & knowledge management, pp 3133–3140
Sahu, S; Mhedhbi, A; Salihoglu, S; Lin, J; Özsu, MT. The ubiquity of large graphs and surprising challenges of graph processing: extended survey. VLDB J; 2020; 29, pp. 595-618. [DOI: https://dx.doi.org/10.1007/s00778-019-00548-x]
Sakr, S; Bonifati, A; Voigt, H; Iosup, A; Ammar, K; Angles, R; Aref, W; Arenas, M; Besta, M; Boncz, PA et al. The future is big graphs: a community view on graph processing systems. Commun ACM; 2021; 64,
Shin, K; Oh, S; Kim, J; Hooi, B; Faloutsos, C. Fast, accurate and provable triangle counting in fully dynamic graph streams. ACM Trans Knowl Discov Data (TKDD); 2020; 14,
Staudt, CL; Sazonovs, A; Meyerhenke, H. Networkit: a tool suite for large-scale complex network analysis. Netw Sci; 2016; 4,
Stumpf, MP; Wiuf, C; May, RM. Subnets of scale-free networks are not scale-free: sampling properties of networks. Proc Natl Acad Sci; 2005; 102,
Swift, IP; Ebrahimi, S; Nova, A; Asudeh, A. Maximizing fair content spread via edge suggestion in social networks. Proc VLDB Endow; 2022; 15,
Tan Q, Zhang J, Yao J, Liu N, Zhou J, Yang H, Hu X (2021) Sparse-interest network for sequential recommendation. In: Proceedings of the 14th ACM international conference on web search and data mining, pp 598–606
Tětek J, Thorup M (2022) Edge sampling and graph parameter estimation via vertex neighborhood accesses. In: Proceedings of the 54th annual ACM SIGACT symposium on theory of computing, pp 1116–1129
Trolliet, T; Cohen, N; Giroire, F; Hogie, L; Pérennes, S. Interest clustering coefficient: a new metric for directed networks like twitter. J Complex Netw; 2022; 10,
Van Koevering K, Benson A, Kleinberg J (2021) Random graphs with prescribed k-core sequences: a new null model for network analysis. In: Proceedings of the web conference, pp 367–378
Wan, C; Li, Y; Li, A; Kim, NS; Lin, Y. Bns-gcn: efficient full-graph training of graph convolutional networks with partition-parallelism and random boundary node sampling. Proc Mach Learn Syst; 2022; 4, pp. 673-693.
Yang, C; Buluç, A; Owens, JD. Graphblast: a high-performance linear algebra-based graph framework on the gpu. ACM Trans Math Softw (TOMS); 2022; 48,
Yang K, Zhang M, Chen K, Ma X, Bai Y, Jiang Y (2019) Knightking: a fast distributed graph random walk engine. In: Proceedings of the 27th ACM symposium on operating systems principles, pp 524–537
You J, Leskovec J, He K, Xie S (2020) Graph structure of neural networks. In: International conference on machine learning. PMLR, pp 10,881–10,891
Zeng H, Zhou H, Srivastava A, Kannan R, Prasanna V (2020) Graphsaint: graph sampling based inductive learning method
Zhang, Z; Liu, Q; Hu, Q; Lee, CK. Hierarchical graph transformer with adaptive node sampling. Adv Neural Inf Process Syst; 2022; 35, pp. 21171-21183.
Zhao, Y; Jiang, H; Qin, Y; Xie, H; Wu, Y; Liu, S; Zhou, Z; Xia, J; Zhou, F et al. Preserving minority structures in graph sampling. IEEE Trans Vis Comput Graph; 2020; 27,
Zheng C, Zong B, Cheng W, Song D, Ni J, Yu W, Chen H, Wang W (2020) Robust graph representation learning via neural sparsification. In: International conference on machine learning. PMLR, pp 11458–11468
Zhu Z, Wu K, Liu Z (2023) Arya: arbitrary graph pattern mining with decomposition-based sampling. In: 20th USENIX symposium on networked systems design and implementation (NSDI 23), pp 1013–1030
Copyright Springer Nature B.V. Dec 2024