Content area
Graph summarization is a well-established problem in large-scale graph data management. Its goal is to produce a summary graph, which is a coarse-grained version of a graph, whose use in substitution for the original graph enables downstream task execution and query processing at scale. Despite the extensive literature on graph summarization, still nowadays query processing on summary graphs is accomplished by either reconstructing the original graph, or in a query-specific manner. No general methods exist that operate on the summary graph only, with no graph reconstruction. In this paper, we fill this gap, and study for the first time general-purpose (approximate) query processing on summary graphs. This is a new important tool to support data-management tasks that rely on scalable graph query processing, including social network analysis. We set the stage of this problem, by devising basic, yet principled algorithms, and thoroughly analyzing their peculiarities and capabilities of performing well in practice, both conceptually and experimentally. The ultimate goal of this work is to make researchers and practitioners aware of this so-far overlooked problem, and define an authoritative starting point to stimulate and drive further research.
Introduction
Graphs, which are sets of entities (vertices) linked to one another (via edges), have become a ubiquitous real-world data-representation model (Aggarwal and Wang 2010a). In many domains, graphs are so large that it is hard to directly process them in their raw form. For instance, modern social networks count billions or even more vertices and edges. Thus, the problem of reducing the input graph—so as to make graph data management less time/space-demanding without changing algorithms for downstream query processing/task execution—has received considerable attention from researchers, practitioners, and in the industry (Besta and Hoefler 2018; Liu et al. 2018). Generating more compact versions of a big graph has been addressed from multiple perspectives. The problem that is usually termed graph compression (though there is no uniformity in the nomenclature in the literature) aims at developing ad-hoc data structures (and algorithms to manipulate them) to store and retrieve an exact representation of a graph using the minimum possible space (Besta and Hoefler 2018). Graph sparsification (a.k.a., graph simplification, or graph backboning) reduces a graph by removing edges/vertices (Coscia and Neffke 2017; Fung et al. 2019; Zeng et al. 2021). Graph summarization produces a coarse-grained version of the original graph—a summary graph—by grouping vertices and edges into supernodes and superedges (Beg et al. 2018; Hajiabadi et al. 2021; Khan et al. 2015; Ko et al. 2020; Kumar and Efstathopoulos 2018; Lee et al. 2020; LeFevre and Terzi 2010; Liu et al. 2014; Navlakha et al. 2008; Riondato et al. 2017; Shin et al. 2019; Toivonen et al. 2011; Yong et al. 2021) (see Fig. 1).1 Each of these graph-reduction approaches comes with its own pros and cons, effectiveness, and suitability, depending on the application scenario at hand. Graph summarization is mainly appealing as, unlike graph compression, it does not require to employ algorithms (and corresponding software) that have been developed ad-hoc for handling the data structures of a particular compression technique. Also, unlike graph sparsification, it guarantees that each vertex of the original graph forms a part of the output summary graph (through the corresponding supernode).
All these peculiarites make graph summarization a problem particularly appealing for the social-network context, where graphs to be processed are big, and keeping handling (a size-reduced version of) them with standard methodologies and data structures is highly desirable for purposes of generality and software reusability. This is eagerly required, for example, in social-network analysis, which heavily relies on well-established and acknowledged tools.
Fig. 1 [Images not available. See PDF.]
Illustration of graph summarization. In
Motivation. Designing methods for producing summary graphs has been an extremely active research area. A plethora of graph-summarization methods exist, based on various design principles, and targeting different graph types and applications; see Sect. 2.
Despite this extensive literature on graph summarization, an aspect that has received limited attention is how to use graph summaries for effective and efficient approximate query processing. Existing general-purpose query-processing methods on summary graphs (methods that depend on no specific query type) reconstruct on-the-fly the input graph (Hajiabadi et al. 2021; Lee et al. 2022; Kang et al. 2022a, b; Shin et al. 2019). Conversely, existing methods that exploit the summary graph only (without reconstructing the input graph) are query-specific (Hajiabadi et al. 2021; LeFevre and Terzi 2010; Riondato et al. 2014, 2017). To the best of our knowledge, no query-processing method on summary graphs exist that are general-purpose and use the summary graph only.
Contributions
In this paper, we fill this gap and study, for the first time, the problem of general-purpose (approximate) query-processing on summary graphs (
With these design principles in place, we devise two methods for
We remark that our
Conversely, we point out that providing a comprehensive comparative evaluation of existing methods (summary-graph–construction methods, graph-query-processing methods, etc.) is not our focus.
Benefits of this work include: (1) bringing to the attention of the graph-data-management community
In general, we believe that setting the stage of
Summary and roadmap. Our main contributions are as follows:
We study for the first time the problem of general-purpose (approximate) query processing on summary graphs (
GPQPS ), with the main goal of setting its stage—mainly conceptually and experimentally—and defining a principled starting point on it for researchers and practitioners (Sect. 3).We devise simple algorithms for
GPQPS , which are intended to be a reference for the problem and constitute a basic bar, to be raised by future work (Sect. 4).We set up an evaluation methodology that constitutes a benchmark testbed for this and future
GPQPS studies (Sect. 5).We perform extensive experiments, to derive insights on the practical impact and obstacles to an effective employment of the
GPQPS methods (Sects. 6–7).We provide nontrivial directions for further research (Sect. 8). In this regard, we remark that such directions are not as high-level as the usual ones of scientific papers; rather, they are a concrete yet highly specific roadmap that we could draw thanks to all the previously listed contributions.
Preliminaries and related work
The problem of graph summarization (Liu et al. 2018) takes as input a graph that, in the general case, is assumed to be directed and edge weighted. Formally, we are given a triple , where V is a set of vertices, is a set of edges, (i.e., ordered pairs of vertices), and is a function assigning a positive real-valued weight to every edge. Should G be undirected, E is defined as a set of unordered vertex pairs. Should G be unweighted, then , . A graph may alternatively, yet equivalently, be represented with its adjacency matrix, where , if , , otherwise, .
Graph summarization relies on the key notion of summary graph:
Definition 1
(Summary graph) A summary graph (or, simply, a summary) of a given graph is a directed and (possibly) edge-weighted graph , with vertices , edges , and edge-weighting function , such that every vertex is assigned to one and only one : , , . We term vertices and edges of a summary supernodes and superedges, respectively. The supernode to which a vertex belongs is denoted by . and denote the edges and the overall edge weight between all the vertices of supernodes S and T, respectively.
Simply speaking, a summary of a graph represents a partition of the vertices in V into supernodes. Superedges are pairs (S, T) of supernodes, which are unordered if the original graph is undirected, and ordered otherwise. Superedges may possibly be assigned a real-valued weight . However, a summary can simply be unweighted (Kang et al. 2022a). This case is modeled by setting , for all . Summary graphs admit self-loops, that is, superedges of the form (S, S). In general, a superedge (S, T), along with its possible weight , is meant to concisely represent the information about the various edges in G that connect a vertex in S and some other vertex in T.
From a summary we can derive the so-called reconstructed graph:
Definition 2
(Reconstructed graph) Given a summary of a graph G, the reconstructed graph is a graph that is defined by properly exploiting .
The definition of reconstructed graph is voluntarily left general here, as it depends on the specific graph-summarization method. Broadly speaking, the problem of graph summarization consists in finding a summary of a given graph such as to optimize some criteria that are typically aimed at both (1) minimizing the difference between the original graph and the reconstructed graph, and (2) keeping the size of graph summarization’s output low (Liu et al. 2018). Specific formulations of graph summarization fall into two main classes: size-driven and utility-driven, which we overview below.
Figure 1 illustrates the notions in Definitions 1 and 2.
Lossless vs. lossy graph summarization
Regardless of the formulation, graph summarization can be either lossless or lossy. In the first case, the original graph can be recovered exactly from the output of graph summarization. To guarantee such an exactness, several graph-summarization methods yield a correction set together with a summary graph (Ko et al. 2020; Navlakha et al. 2008; Shin et al. 2019; Yong et al. 2021), that is, edges to be added to or removed from the reconstructed graph so as to make it actually correspond to the input graph.
Conversely, a lossy graph-summarization output is not required to allow for exactly recovering the input graph. As such, lossy graph summarization typically outputs the summary only, with no correction set. In this work, we focus on lossy graph summarization. Thus, we will not address the notion of correction set further.
Size-driven graph summarization
Let be an error function that quantifies how good a summary is for a graph G, in terms of the difference between the reconstructed graph and G. Given an integer , size-driven graph summarization looks for a summary with supernodes that minimizes (Beg et al. 2018; Lee et al. 2020; LeFevre and Terzi 2010; Liu et al. 2014; Riondato et al. 2014, 2017; Toivonen et al. 2011). Existing approaches of size-driven graph summarization differ from each other in the specific error function employed.
The -reconstruction-error is a popular error function (Beg et al. 2018; LeFevre and Terzi 2010; Riondato et al. 2014, 2017). Let and be the probability of existence and the average weight of an edge between supernodes S and T, respectively:
1
The lifted adjacency matrix given is defined as a matrix whose cell contains the expected weight of an edge between the supernodes of u and v: . The -reconstruction error is defined as the entry-wise p-norm of the difference between the adjacency matrix M of the input graph and , that is,LeFevre and Terzi (2010) deal with and propose a greedy heuristic algorithm that resembles an agglomerative hierarchical clustering using Ward’s method (Ward 1963). Riondato et al. (2014, 2017) establish a connection between -based graph summarization and -clustering, and design algorithms with constant-factor approximation guarantees for and . Among Riondato et al.’s algorithms, the
Beg et al. (2018) and Lee et al. (2020) develop techniques that are mainly focused on improving the efficiency of the aforementioned algorithms.
Other error functions adopted in the literature include cut-norm error (Riondato et al. 2017), representation error (Liu et al. 2014), and path-based error (Toivonen et al. 2011; Zhou et al. 2017).
Utility-driven graph summarization
This is the dual formulation of the size-driven counterpart: given a utility function, which expresses how close the graph reconstructed from summary is to the input graph G, find a summary of minimum size, subject to the constraint that is no less than a given threshold (Hajiabadi et al. 2021; Khan et al. 2015; Ko et al. 2020; Kumar and Efstathopoulos 2018; Navlakha et al. 2008; Shin et al. 2019; Yong et al. 2021). Navlakha et al. (2008) adopt a utility function based on the representation error (see above). Shin et al. (2019) introduce
Summarizing weighted/directed graphs
Graph-summarization methods typically handle unweighted and undirected graphs. A few methods (can be easily adapted to) work for edge-weighted graphs (LeFevre and Terzi 2010; Riondato et al. 2014, 2017; Toivonen et al. 2011; Zhou et al. 2017). As for directed graphs, to the best of our knowledge, only Riondato et al. (2017) handle them: they propose to decompose the input adjacency matrix into the sum of a symmetric matrix and a skew-symmetric matrix, and compute a summary for each of such matrices. The resulting summaries still come with (constant-factor) approximation guarantees, as the theoretical properties of their algorithms carry over to skew-symmetric matrices.
Query processing on summary graphs
A graph summary corresponds to a succinct representation of a graph (even though possibly lossy). As such, a major use of a summary graph is to speedup query processing on graphs.
Existing general-purpose (approximate) query-processing methods (methods that depend on no specific query type) assume that the basic primitive for graph queries corresponds to retrieving the neighborhood of a vertex. Based on this, such methods reconstruct on-the-fly a neighborhood from the summary when processing the target query (Hajiabadi et al. 2021; Lee et al. 2022; Kang et al. 2022a, b; Shin et al. 2019). This strategy is beneficial for space complexity, as it keeps in memory only the summary. However, it gives no speedup in runtime, as reconstructed neighborhoods are typically larger than actual neighborhoods, at least on average.2
Conversely, existing methods to process graph queries using the summary only (without reconstructing the input graph) are ad-hoc defined for a few specific queries, for instance, degree (LeFevre and Terzi 2010; Riondato et al. 2014, 2017), triangle counting (Riondato et al. 2014, 2017), eigenvector centrality (LeFevre and Terzi 2010; Riondato et al. 2014, 2017), or, in the case of lossless summaries, PageRank and shortest path (Hajiabadi et al. 2021).
To the best of our knowledge, the problem of general-purpose query-processing on summary graphs without reconstructing the input graph has never been tackled. In this paper, we fill this gap.
Other (marginally) related literature
Experimental evaluationsKang et al. (2022a) evaluate how useful are weights on superedges of summary graphs. They focus on and the size of the reconstructed graph, as well as PageRank and vertex-proximity queries. Unlike our work, those queries are evaluated by reconstructing on-the-fly the input graph from the summary.
Besta et al. (2019) devise a programming model, a framework, and a query-processing evaluation for what they term lossy graph compression. By this term they mean “any scheme that removes some parts of graphs,” thus, something that goes beyond graph summarization. Among the techniques tested by Besta et al., there is one graph-summarization method, specifically
Other types of summary Summary graphs may have a form other than the one in Definition 1: they can describe a graph in terms of a given “vocabulary” of subgraph structures (e.g., stars, cliques, chains) (Koutra et al. 2014), or have a hierarchical structure (Lee et al. 2022). Those alternative types of summary are out of the scope of this work.
Problem variants of vanilla graph summarization include meta-graph summarization, whose goal is to apply meta-methods on top of basic graph-summarization methods, so that the resulting summaries exhibit properties that do not otherwise hold (e.g., vertex degree preservation in superedges (Zhou et al. 2021)); personalized graph summarization (Kang et al. 2022b), where summaries are tailored to a given set of target vertices; or summarization of more complex types of graph, such as dynamic/temporal graphs (Gou et al. 2019; Jiang et al. 2023; Tsalouchidou et al. 2020), or multi-relation graphs (Ke et al. 2022).
Query-specific graph summarization All the discussions so far are about summaries that are query-agnostic, that is, not tailored to any specific queries. There exist query-specific graph-summarization approaches too, for queries such as reachability, distance, neighborhood, community membership (Fan et al. 2012, 2021, 2022; Hernández and Navarro 2014; Maserrat and Pei 2010; Sadri et al. 2017). Query-specific graph summarization is not a focus of this work.
Related problems There exist problems that are similar in spirit to graph summarization, while still being different. The one that is usually termed graph compression (though there is no uniformity in the nomenclature in the literature) is concerned with developing data structures (and algorithms to manipulate them) to store (and retrieve) an exact representation of a graph using the minimum possible space (Besta and Hoefler 2018; Boldi et al. 2009; Boldi and Vigna 2004). A major difference with respect to graph summarization is that graph compression is not general: one needs to stick to the algorithms (and corresponding software) that have been ad-hoc developed for handling the data structures of that particular compression technique. Further differences include the fact that, typically, graph compression is lossless and operates at a lower level of graph representation (e.g., at a bit-level).
Graph sparsification (or graph simplification, or graph backboning) consists in reducing a graph by removing edges/vertices. As such, graph sparsification differs from graph summarization as it may completely discard parts of the graph in its output, while graph summarization guarantees that at least every vertex is part of the output summary. Another difference is that, although a few general approaches exist (Coscia and Neffke 2017; Serrano et al. 2009; Slater 2009), graph sparsification is mostly tailored to preserve specific properties, such as shortest paths, degree distribution, or spectral properties (Fung et al. 2019; Ahn et al. 2012; Spielman and Teng 2011; Zeng et al. 2021; Zhou et al. 2010).
Graph clustering aims at finding groups of vertices with high intra-cluster connectivity and inter-cluster separation (Aggarwal and Wang 2010b; Schaeffer 2007). This differs from graph summarization, which groups vertices with similar connection patterns with the rest of the graph.
Problem statement
Graph queries A (graph) queryQ is a computable function whose input is a graph and (query) context, and whose output is an object from a certain domain .3 Context identifies the complementary input of the query. It may correspond to pairs or sets of vertices, subgraphs, functions, numerical values, and so on, or it can also be empty. The output of a query Q (with context ) on a graph G is denoted by , and is alternatively termed the answer of Q on G. The answer of a query can be a Boolean, a numerical value, a set of vertices, a subgraph, a partition of the vertices, and so on. For instance, global queries computing numerical statistics on G (e.g., number of triangles, clustering coefficient, diameter) have and . Node embedding queries take a vertex u as a context , and output a d-dimensional numerical vector representing the embedding of u (thus, ). Inner-most core queries have , and the output is the vertices of the k-core (Batagelj and Zaversnik 2011) of G with the highest k (thus, ). Reachability queries (in directed graphs) take an ordered vertex pair (u, v) as a context , and output a Boolean stating whether v is reachable from u (thus, ). In top-ranked centrality queries, corresponds to an integer s, and the output is the top-s—ranked vertices according to a certain centrality (thus, ). In community detection queries, either contains the parameters of the algorithm to be used (e.g., number of communities), or is empty if the algorithm at hand is parameterless, and the output is a partition of V (thus, , where denotes the set of all possible partitions of V). In this work, we restrict our study to query answers that are either numerical or sets/partitions of vertices (i.e., ). Adaptation of our methodologies to other forms of answer is possible with relatively low effort. Anyway, we defer a more rigorous study of this to future work.
Answering queries from summaries We focus on a scenario where the answer to a query is approximated by exploiting solely a summary of a graph G, without accessingG at all. Also, we require query processing to be agnostic of both the specific query and the graph-summarization technique that has produced . Specifically, we are interested in what we term summary-based approximate query answers:
Definition 3
(Summary-based approximate query answer) Given a graph G, a summary of G, and a query Q on G with context , a summary-based approximate answer to Q on —denoted —is an approximation of obtained by making use only of .
GPQPSproblem With the above notation and discussion in place, we can now state the problem that we tackle in this work:
Problem 1
(
Simply speaking, Problem 1 asks for summary-based query answers which approximate well the true answer to the given query.
Algorithms
Naïve-GPQPSalgorithm As a very first, simple method for Problem 1 we consider the one where a query Q is processed on summary as if it were a normal graph, with the only precaution of letting each vertex u in the input graph G conceptually be identified with the supernode of it belongs to, and vice versa. To be more precise, let us introduce the following notions:
Definition 4
(Summary-aware query context) Given context of a query Q on a graph G, the summary-aware query context of on a summary of G—denoted by —is a copy of where every vertex is replaced with the supernode of it belongs to.
Definition 5
(Summary-processed query answer) Let Q be a query on a graph G with context , be a summary of G, and be the summary-aware context of on . The summary-processed answer to Q on —denoted by —is the answer obtained by processing Q on with context as if were a normal graph.
The simple method considered here is termed
Example 1
Consider a clustering-coefficient query Q on a graph G, i.e., a query whose output is a scalar numerical value corresponding to the average clustering coefficient over all the vertices of G. For this query, , and . The execution of Algorithm 1 on a summary of G for this clustering-coefficient query Q is as follows. First, summary-aware context information is computed from and . As , as well. Then, summary-processed query answer is computed. corresponds to the average clustering coefficient over all the supernodes of . Finally, the ultimate output of the algorithm is exactly equal to , as the query at hand is numerical and .
Probabilistic-GPQPSalgorithm The probabilistic interpretation of, among others, (cf. Sect. 2) suggests to model a summary as an uncertain (or probabilistic) graph, that is, a graph whose edges are assigned a probability of existence:
Definition 6
(Uncertain graph) An uncertain (or probabilistic) graph is a triple , where V is a set of vertices, is a set of edges, and is a function assigning existence probabilities to edges. According to the possible-world semantics (Abiteboul et al. 1987; Dalvi and Suciu 2004), an uncertain graph is interpreted as a set of deterministic graphs (worlds), each defined by a subset of E. Assuming independence among edge probabilities (Khan et al. 2018), the probability of observing any possible world drawn from is .
Our second
Example 2
Consider again the clustering-coefficient query Q of Example 1. The execution of Algorithm 2 on a summary of G for this clustering-coefficient query Q is as follows. First, possible worlds are sampled from . Then, query answer is computed on every world , by running Algorithm 1. Finally, the ultimate output of the algorithm is equal to the average over all the various ’s, as the query at hand is numerical.
Experimental methodology
Datasets We experiment with public datasets, most of which are traditionally used in the graph-summarization literature (Kang et al. 2022b; Ko et al. 2020; Lee et al. 2020; Riondato et al. 2017; Shin et al. 2019), and whose characteristics are reported in Table 1. Regarding the ones composed of multiple connected components (
Table 1. Characteristics of the selected datasets (|V|: #vertices; |E|: #edges) and summaries (: #supernodes; : #superedges; #CCs: #connected components; : size of the giant connected component (%); #self: #self-loops; Cpr.: compression defined as (%))
Characteristics | S2L Summaries | SWeG Summaries | |||||||
|---|---|---|---|---|---|---|---|---|---|
350 | 500 | 750 | 708 | 977 | 1168 | ||||
|V|: 4,039 | 8462 | 14,390 | 24,327 | 664 | 2157 | 4448 | |||
|E|: 88,234 | #CCs | 1 | 1 | 1 | #CCs | 240 | 18 | 38 | |
Undirected | 100% | 100% | 100% | 20% | 77% | 79% | |||
Unweighted | #self | 264 | 280 | 323 | #self | 11 | 22 | 36 | |
Cpr. | 86% | 79% | 68% | Cpr. | 94% | 92% | 90% | ||
500 | 750 | 1000 | 3375 | 3568 | 3821 | ||||
|V|: 7429 | 6835 | 10,350 | 12,768 | 1217 | 1550 | 1997 | |||
|E|: 27788 | #CCs | 1 | 1 | 1 | #CCs | 2169 | . 2076. | 1966 | |
Undirected | 100% | 100% | 100% | 1% | 11% | . 26% | |||
Unweighted | #self | 133 | 176 | 185 | #self | 6 | 10 | 17 | |
Cpr. | 58% | 47% | 40% | Cpr. | 66% | 64% | 62% | ||
1000 | 1500 | 2000 | 22,832 | 23,957 | 24,371 | ||||
|V|: 33,696 | 50,872 | 68,405 | 81,444 | 10,160 | 28,825 | 44,086 | |||
|E|: 180,811 | #CCs | 1 | 1 | 1 | #CCs | 14,537 | 7159 | 6056 | |
Undirected | 100% | 100% | 100% | 19% | 64% | 70% | |||
Unweighted | #self | 231 | 303 | 397 | #self | 557 | 702 | 749 | |
Cpr. | 68% | 61% | 54% | Cpr. | 69% | 60% | 52% | ||
500 | 750 | 1000 | – | – | – | ||||
|V|: 6301 | 2016 | 3383 | 5000 | – | – | – | |||
|E|: 20,777 | #CCs | 1 | 1 | 1 | #CCs | – | – | – | |
Directed | 100% | 100% | 100% | – | – | – | |||
Unweighted | #self | 5 | 6 | 12 | #self | – | – | – | |
Cpr. | 60% | 52% | 45% | Cpr. | – | – | – | ||
350 | 500 | 750 | – | – | – | ||||
|V|: 3024 | 35,346 | 55,276 | 82,589 | – | – | – | |||
|E|: 144,094 | #CCs | 1 | 1 | 1 | #CCs | – | – | – | |
Undirected | 100% | 100% | 100% | – | – | – | |||
Weighted | #self | 67 | 86 | 110 | #self | – | – | – | |
Cpr. | 74% | 60% | 41% | Cpr. | – | – | – | ||
1000 | 10,000 | 25,000 | 1,087,434 | 1,087,430 | 1,103,931 | ||||
|V|: 1,696,415 | 39,064 | 461,826 | 983,129 | 1,704,954 | 1,857,562 | 2,073,416 | |||
|E|: 11,095,298 | #CCs | 4 | 38 | 73 | #CCs | 204,299 | 200,833 | 181,766 | |
Undirected | 99.6% | 99.7% | 99.6% | 70.1% | 72.7% | 75.9% | |||
Unweighted | #self | 671 | 5306 | 10,746 | #self | 11,936 | 11,955 | 12,325 | |
Cpr. | 86% | 83% | 79% | Cpr. | 65% | 64% | 62% | ||
Graph-summarization methods To generate summaries, we selected one representative per category of graph-summarization approach (see Sect. 2), namely size-driven
We used a
Queries We selected queries for all the classes discussed in Sect. 3: clustering coefficient, representative of numerical queries (answer ); community detection, representative of partitioning queries (answer ); top-ranked centrality and core decomposition, representatives of vertex-set queries (answer ).
Clustering coefficient refers to the average of the individual clustering coefficient of every vertex in the graph. Community detection’s output is a partition of the input vertices into communities, computed with the well-established
GPQPSmethods We evaluate
Assessment criteria We assess accuracy, efficiency, and space requirements of the
Accuracy depends on the specific query. For clustering coefficient, we measure the relative error of the answer obtained via
Regarding centrality, a direct comparison of centrality scores is not really meaningful. Instead, we compare the centrality rank (ties broken randomly) of all the vertices obtained in the original graph and via
As for core decomposition, we assess it similarly to centrality. We take the inner-most core of the graph as a ground-truth vertex set, the top-z inner-most cores computed by
Environment We run all experiments with no parallelization on a single machine equipped with a Dual 20-Core Intel Xeon E5-2698 v4 2.20GHz CPU and 512GB RAM.
Experimental results
Storage space Table 1 shows the compression percentages of the various summaries. Those percentages attest how
Effectiveness and efficiency results are reported in Tables 2, 3, 4. and 5.
Table 2.
Method | Clustering coefficient | Community detection (modularity) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Relative error | Runtime (s) | Relative error | Runtime (s) | ||||||||||
#supernodes: | 350 | 500 | 750 | 350 | 500 | 750 | 350 | 500 | 750 | 350 | 500 | 750 | |
#superedges: | 8k | 14k | 24k | 8k | 14k | 24k | 8k | 14k | 24k | 8k | 14k | 24k | |
– | 1.69 | – | .86 | ||||||||||
.263 | .21 | .138 | .121 | .261 | .548 | .124 | .101 | .038 | .06 | .11 | .24 | ||
.466 | .367 | .284 | .864 | 1.74 | 3.01 | .116 | .043 | .043 | .06 | .11 | .23 | ||
.034 | .025 | .01 | 1.12 | 3.3 | 7.45 | .069 | .047 | .052 | 1.11 | 1.88 | 4.09 | ||
.034 | .025 | .01 | 5.92 | 18.6 | 42.6 | .058 | .045 | .052 | .97 | 2.02 | 3.81 | ||
.207 | .161 | .128 | 6.26 | 19.4 | 43.4 | .115 | .106 | .065 | .99 | 1.83 | 4.03 | ||
#supernodes: | 500 | 750 | 1k | 500 | 750 | 1k | 500 | 750 | 1k | 500 | 750 | 1k | |
#superedges: | 7k | 10k | 13k | 7k | 10k | 13k | 7k | 10k | 13k | 7k | 10k | 13k | |
– | .215 | – | .61 | ||||||||||
1.67 | 1.31 | 1.13 | .069 | .111 | .141 | .342 | .309 | .262 | .06 | .11 | .13 | ||
.002 | .075 | .048 | .263 | .396 | .452 | .329 | .299 | .268 | .16 | .12 | .29 | ||
1.02 | .857 | .755 | .456 | .769 | 1.08 | .388 | .334 | .321 | .95 | 1.48 | 2.08 | ||
1.02 | .857 | .755 | 1.53 | 2.51 | 3.36 | .37 | .317 | .332 | 1.01 | 1.56 | 1.93 | ||
.766 | .64 | .579 | 1.56 | 2.54 | 3.42 | .353 | .317 | .278 | .92 | 1.58 | 2.12 | ||
#supernodes: | 1k | 1.5k | 2k | 1k | 1.5k | 2k | 1k | 1.5k | 2k | 1k | 1.5k | 2k | |
#superedges: | 51k | 68k | 81k | 51k | 68k | 81k | 51k | 68k | 81k | 51k | 68k | 81k | |
– | 3.97 | – | 6.58 | ||||||||||
.132 | .198 | .244 | 1.64 | 2.09 | 2.25 | .271 | .215 | .218 | .54 | .67 | 1.36 | ||
.666 | .632 | .613 | 6.87 | 8.06 | 8.16 | .23 | .201 | .18 | .51 | .77 | 1.74 | ||
.273 | .308 | .31 | 1.08 | 16.6 | 22.3 | .301 | .325 | .312 | 6.97 | 11.1 | 17.6 | ||
.279 | .309 | .312 | 38 | 59.1 | 73.7 | .318 | .322 | .317 | 6.61 | 10.5 | 20.7 | ||
.37 | .383 | .381 | 38.6 | 6.45 | 75.5 | .238 | .223 | .223 | 7.33 | 13.1 | 21.5 | ||
#supernodes: | 500 | 750 | 1k | 500 | 750 | 1k | 500 | 750 | 1k | 500 | 750 | 1k | |
#superedges: | 2k | 3k | 5k | 2k | 3k | 5k | 2k | 3k | 5k | 2k | 3k | 5k | |
– | .128 | – | 1.13 | ||||||||||
39.9 | 34.8 | 30.1 | .017 | .033 | .055 | .947 | .914 | .865 | .05 | .1 | .15 | ||
1.28 | .677 | .422 | .035 | .064 | .1 | .954 | .910 | .863 | .05 | .11 | .16 | ||
.017 | .19 | .3 | .067 | .108 | .168 | 1.02 | 1.02 | 1.01 | 1.28 | 1.94 | 2.37 | ||
.017 | .19 | .3 | .085 | .142 | .268 | 1.02 | 1.02 | 1 | 1.5 | 2.26 | 2.32 | ||
.003 | .222 | .345 | .085 | .142 | .238 | 1.02 | 1.01 | 1.02 | 1.19 | 1.69 | 2.4 | ||
#supernodes: | 350 | 500 | 750 | 350 | 500 | 750 | 350 | 500 | 750 | 350 | 500 | 750 | |
#superedges: | 35k | 55k | 83k | 35k | 55k | 83k | 35k | 55k | 83k | 35k | 55k | 83k | |
– | 33.5 | – | 1.79 | ||||||||||
1316 | 1208 | 1075 | .999 | 2.28 | 3.5 | .838 | .733 | .656 | .26 | .4 | .86 | ||
1.18 | .526 | .123 | 11.4 | 19.5 | 28.1 | .130 | .093 | .055 | .29 | .49 | .76 | ||
1231 | 1150 | 1058 | 14.9 | 31.2 | 52.6 | .788 | .783 | .66 | 4.76 | 7.69 | 12.9 | ||
1.88 | 1.07 | .509 | 145 | 244 | 371 | .202 | .117 | .078 | 5.31 | 8.37 | 13.4 | ||
1.77 | .982 | .439 | 146 | 246 | 379 | .153 | .114 | .069 | 4.99 | 7.94 | 14.8 | ||
#supernodes: | 1k | 10k | 25k | 1k | 10k | 25k | 1k | 10k | 25k | 1k | 10k | 25k | |
#superedges: | 39k | 462k | 983k | 39k | 462k | 983k | 39k | 462k | 983k | 39k | 462k | 983k | |
– | 1.6k | – | 440 | ||||||||||
1.58 | .8 | .56 | 1.54 | 53.8 | 106 | .544 | .342 | .283 | .381 | 6.32 | 14.9 | ||
.92 | .815 | .754 | 8.03 | 118 | 217 | .518 | .35 | .289 | .452 | 7.09 | 18.7 | ||
.44 | .191 | .155 | .417 | 30.9 | 147 | .632 | .5 | .455 | 45.1 | 92.7 | 204 | ||
.184 | .129 | .117 | 3.95 | 133 | 593 | .734 | .609 | .509 | 47.9 | 91.8 | 216 | ||
.657 | .451 | .402 | 1.3 | 109 | 433 | .666 | .465 | .416 | 96.6 | 104 | 234 | ||
“
Table 3.
Dataset | Method | Clustering coefficient | Community detection (modularity) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Relative error | Runtime (s) | Relative error | Runtime (s) | ||||||||||
#supernodes | 708 | 977 | 1.2k | 708 | 977 | 1.2k | 708 | 977 | 1.2k | 708 | 977 | 1.2k | |
#superedges | 664 | 2.2k | 4.4k | 664 | 2.2k | 4.4k | 664 | 2.2k | 4.4k | 664 | 2.2k | 4.4k | |
– | 1.69 | – | .86 | ||||||||||
.652 | .404 | .228 | .006 | .014 | .036 | .23 | .204 | .21 | .03 | .06 | .08 | ||
#supernodes | 3.4k | 3.6k | 3.8k | 3.4k | 3.6k | 3.8k | 3.4k | 3.6k | 3.8k | 3.4k | 3.6k | 3.8k | |
#superedges | 1.2k | 1.6k | 2k | 1.2k | 1.6k | 2k | 1.2k | 1.6k | 2k | 1.2k | 1.6k | 2k | |
– | .215 | – | .61 | ||||||||||
.991 | .951 | .924 | .012 | .015 | .017 | .703 | .511 | .512 | .19 | .28 | .33 | ||
#supernodes | 23k | 24k | 24.4k | 23k | 24k | 24.4k | 23k | 24k | 24.4k | 23k | 24k | 24.4k | |
#superedges | 10k | 29k | 44k | 10k | 29k | 44k | 10k | 29k | 44k | 10k | 29k | 44k | |
– | 3.97 | – | 6.58 | ||||||||||
.907 | .773 | .568 | .196 | .416 | .773 | .731 | .464 | .412 | 2.39 | 2.15 | 2.74 | ||
#supernodes | 1.1M | 1.1M | 1.1M | 1.1M | 1.1M | 1.1M | 1.1M | 1.1M | 1.1M | 1.1M | 1.1M | 1.1M | |
#superedges | 1.7M | 1.9M | 2.1M | 1.7M | 1.9M | 2.1M | 1.7M | 1.9M | 2.1M | 1.7M | 1.9M | 2.1M | |
– | 1.6k | – | 440 | ||||||||||
.044 | .071 | .076 | 60.8 | 76.8 | 78.4 | .679 | .675 | .688 | 197 | 192 | 206 | ||
“
Table 4.
Dataset | Method | PageRank centrality | Closeness centrality | Core decomposition | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
runt. | runt. | runt. | ||||||||||||||||||||
P | R | P | R | P | R | P | R | P | R | P | R | P | R | P | R | P | R | |||||
#supernodes: 500, #superedges: 14,390 | ||||||||||||||||||||||
2.2 | – | 23 | – | .64 | – | |||||||||||||||||
.98 | .26 | .26 | .21 | .43 | .13 | .64 | .43 | .22 | .22 | .15 | .31 | .08 | .41 | .11 | .99 | .05 | .99 | .05 | .93 | .69 | ||
.56 | .36 | .36 | .23 | .45 | .15 | .76 | 6.7 | .18 | .18 | .15 | .31 | .08 | .41 | .13 | 1 | .12 | 1 | .12 | 1 | .12 | ||
40 | .39 | .28 | .24 | .41 | .16 | .65 | 45 | .32 | .28 | .16 | .31 | .08 | .4 | 7.7 | 1 | .03 | 1 | .03 | 1 | .03 | ||
40 | .39 | .28 | .24 | .41 | .16 | .65 | 437 | .32 | .28 | .16 | .31 | .08 | .4 | 9.6 | 1 | .04 | 1 | .06 | 1 | .15 | ||
39 | .36 | .3 | .23 | .42 | .16 | .7 | 454 | 0 | 0 | .18 | .27 | .09 | .37 | 9.6 | 1 | .07 | 1 | .07 | 1 | .08 | ||
#supernodes: 750, #superedges: 10,350 | ||||||||||||||||||||||
1.7 | – | 61 | – | .25 | – | |||||||||||||||||
.31 | .66 | .66 | .42 | .85 | .2 | .98 | .81 | .47 | .49 | .27 | .55 | .19 | .93 | .08 | .79 | .14 | .84 | .2 | .84 | .24 | ||
.36 | .69 | .69 | .42 | .85 | .2 | 1 | 8.7 | .11 | .11 | .09 | .22 | .13 | .68 | .09 | 1 | .04 | 1 | .06 | 1 | .09 | ||
20.3 | .79 | .61 | .51 | .79 | .21 | .97 | 76 | .8 | .6 | .51 | .88 | .2 | .95 | 4 | 1 | .06 | 1 | .08 | 1 | .15 | ||
20.8 | .79 | .61 | .51 | .79 | .21 | .97 | 337 | .8 | .6 | .51 | .88 | .2 | .95 | 5.3 | 1 | .04 | 1 | .09 | 1 | .15 | ||
20.9 | .72 | .65 | .46 | .83 | .21 | .99 | 367 | 1 | .01 | 1 | .03 | .31 | .94 | 5.1 | 1 | .04 | 1 | .07 | 1 | .12 | ||
#supernodes: 1500, #superedges: 68,405 | ||||||||||||||||||||||
24 | – | 1.2k | – | 4.2 | – | |||||||||||||||||
1.7 | .56 | .56 | .35 | .7 | .17 | .87 | 8.8 | .75 | .75 | .46 | .91 | .19 | .98 | .55 | .52 | .62 | .53 | .67 | .48 | .92 | ||
2.1 | .56 | .56 | .35 | .7 | .17 | .86 | 112 | 0 | 0 | .01 | .04 | .02 | .12 | .67 | .77 | .83 | .67 | .9 | .58 | .92 | ||
123 | .6 | .53 | .37 | .68 | .18 | .84 | 610 | .89 | .78 | .54 | .98 | .21 | .99 | 32 | .65 | .65 | .59 | .68 | .59 | .74 | ||
126 | .6 | .53 | .37 | .68 | .18 | .84 | 5.8k | .89 | .78 | .54 | .98 | .21 | .99 | 43 | .64 | .66 | .58 | .69 | .49 | .75 | ||
130 | .59 | .55 | .36 | .69 | .18 | .86 | 6.4k | 0 | 0 | 0 | 0 | 1 | .03 | 42 | .78 | .61 | .69 | .67 | .62 | .75 | ||
#supernodes: 750, #superedges: 3383 | ||||||||||||||||||||||
.76 | – | 42 | – | .2 | – | |||||||||||||||||
.11 | .65 | .65 | .4 | .8 | .17 | .87 | .14 | .62 | .72 | .36 | .73 | .14 | .73 | .03 | .99 | .15 | .99 | .16 | 1 | 1 | ||
.15 | .55 | .55 | .39 | .78 | .18 | .88 | 2.9 | .03 | .05 | .02 | .05 | .02 | .09 | .03 | .91 | .01 | .94 | .01 | .96 | .02 | ||
8.3 | 0 | 0 | 1 | .08 | .85 | .39 | 37 | .96 | .24 | .84 | .41 | .76 | .6 | 1.2 | .82 | 0 | .86 | 0 | .97 | .02 | ||
8.6 | 0 | 0 | 1 | .08 | .85 | .39 | 77 | .96 | .24 | .84 | .41 | .76 | .6 | 1.3 | .82 | 0 | .86 | 0 | .97 | .02 | ||
8.8 | 0 | 0 | 1 | .08 | .84 | .38 | 77 | 1 | .11 | .84 | .38 | .76 | .6 | 1.3 | 1 | 0 | 1 | 0 | .9 | .01 | ||
#supernodes: 500, #superedges: 55276 | ||||||||||||||||||||||
4.1 | – | 471 | – | 1.5 | – | |||||||||||||||||
1.1 | .8 | .8 | .49 | .99 | .2 | 1 | .79 | .51 | .5 | .34 | .68 | .16 | .81 | .41 | .99 | .27 | .99 | .27 | .99 | .28 | ||
1.5 | .96 | .96 | .5 | 1 | .2 | 1 | 28 | .04 | .04 | .06 | .12 | .14 | .68 | .46 | 1 | 0 | 1 | .01 | .97 | .01 | ||
94 | .82 | .75 | .53 | .99 | .21 | 1 | 52 | .54 | .5 | .34 | .63 | .17 | .81 | 32 | .99 | .2 | .99 | .26 | .82 | .88 | ||
125 | .97 | .95 | .51 | 1 | .2 | 1 | 1.9k | .6 | .39 | .35 | .52 | .17 | .81 | 38 | 1 | 0 | 1 | .01 | .97 | .01 | ||
127 | .97 | .95 | .51 | 1 | .2 | 1 | 2k | 0 | 0 | 0 | 0 | .19 | .81 | 38 | 1 | 0 | 1 | .01 | .97 | .01 | ||
#supernodes: 10,000, #superedges: 461,826 | ||||||||||||||||||||||
66 | – | 2.5k | – | 180 | – | |||||||||||||||||
2.17 | .27 | .27 | .21 | .43 | .14 | .71 | 53.4 | .42 | .42 | .37 | .75 | .1 | 1 | 4.6 | .8 | .57 | .8 | .57 | .8 | .58 | ||
20.1 | .36 | .36 | .24 | .49 | .16 | .79 | 266 | .43 | .43 | .38 | .75 | .1 | 1 | 5.9 | 1 | 0 | 1 | 0 | 1 | 0 | ||
383 | .38 | .27 | .3 | .46 | .16 | .69 | 630 | .46 | .41 | .38 | .71 | .1 | .96 | 125 | 1 | .06 | 1 | .06 | 1 | .07 | ||
444 | .38 | .27 | .3 | .46 | .16 | .69 | 4.5k | .46 | .41 | .38 | .71 | .1 | .96 | 139 | .81 | .61 | .81 | .61 | .81 | .62 | ||
391 | .37 | .31 | .27 | .46 | .16 | .72 | 3.7k | .7 | .26 | .64 | .65 | .11 | .92 | 126 | 1 | 0 | 1 | 0 | 1 | 0 | ||
“
Table 5.
Dataset | Method | PageRank centrality | Closeness centrality | Core decomposition | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
runt. | runt. | runt. | ||||||||||||||||||||
P | R | P | R | P | R | P | R | P | R | P | R | P | R | P | R | P | R | |||||
#supernodes: 977, #superedges: 2157 | ||||||||||||||||||||||
2.2 | – | 23 | – | .64 | – | |||||||||||||||||
.12 | .13 | .13 | .07 | .13 | .03 | .13 | .54 | .14 | .14 | .09 | .17 | .03 | .17 | .02 | .97 | .99 | .97 | .99 | .94 | 1 | ||
#supernodes: 3568, #superedges: 3821 | ||||||||||||||||||||||
1.7 | – | 61 | – | .25 | – | |||||||||||||||||
.18 | .01 | .01 | 0 | .01 | 0 | .02 | .17 | 0 | 0 | 0 | 0 | 0 | 0 | .03 | .71 | .05 | .45 | .11 | .38 | 1 | ||
#supernodes: 23,957, #superedges: 28,825 | ||||||||||||||||||||||
24 | – | 1.2k | – | 4.2 | – | |||||||||||||||||
1.9 | .67 | .67 | .34 | .67 | .13 | .67 | 209 | .37 | .37 | .23 | .46 | .12 | .58 | .51 | .02 | .48 | .02 | .48 | .02 | 1 | ||
#supernodes: 1,087,430, #superedges: 1,857,562 | ||||||||||||||||||||||
66 | – | 2.5k | – | 180 | – | |||||||||||||||||
9.9 | .65 | .65 | .39 | .78 | .19 | .95 | 712 | .18 | .18 | .09 | .18 | .08 | .39 | 30 | 1 | .34 | 1 | .34 | .89 | .84 | ||
“
Because of lack of space, we report results with varying the summary size for clustering coefficient and community detection queries, and for the other queries we show results for one summary size only. Results with other summary sizes are in the supplementary material (Anagnostopoulos et al. 2024).
General remarks
As expected, the accuracy and runtime overall increase as the the summary size increases. Regarding accuracy, a few exceptions may arise with the
The
The
Clustering coefficient
(Tables 2 and 3). As for
Community detection
(Tables 2 and 3). Results here are broadly in line with the ones of clustering coefficient. Error values are slightly lower, but still mostly falling into a [0.2, 0.3] range. Again, the unweighted
PageRank centrality
(Tables 4 and 5)). As expected, increasing the number s of the summary-retrieved vertices, leads to decreasing precision and increasing recall. On
Closeness centrality
(Tables 4 and 5). Results here are mostly in accordance with PageRank, with an overall slight decrease in the various precision and recall scores. This is not surprising, because closeness centrality is a harder query for
Core decomposition
(Tables 4 and 5). Trends and observations here are similar to the previous queries. As a key difference, for or 2, the precision is mostly higher than the recall. This suggests that taking up to the two top-innermost cores of the summary does match the original innermost core, but it does not cover it entirely.
Summary of main findings
Promising effectiveness
Overall, the basic
Obstacles for a more effective GPQPS
Main exceptions to satisfactorily effective
Consistent gain in storage space is achieved by any tested
GPQPS method.Drastic speedup by Naïve-GPQPS It depends on the dataset and the query, but, typically, it is consistent (i.e., orders of magnitude).
Speedup by probabilistic-GPQPS is appreciable for large datasets or expensive queries. However, the usefulness of this method is not limited to those cases as (1) it gives storage space saving nevertheless, and (2) speedup in smaller graphs or less expensive queries can still be achieved by parallelizing its computation over the various worlds (straightforward parallelization).
Increasing summary size corresponds to an increase of effectiveness and a decrease of speedup, which perfectly meets common sense. A few exceptions to effectiveness increase are when the summaries are not well-connected.
Naïve-GPQPS vs. Probabilistic-GPQPS: no clear winner
These two methods are mostly comparable to one another, with each one (slightly) outperforming the other depending on the dataset and the query. We conjecture that a major reason why
No clear winner among weighted and unweighted variants of the various
Future directions and conclusion
In this paper, we introduce general-purpose (approximate) query processing on summary graph (
Refine GPQPS methods in general
The
As for vertex-set or partitioning queries, an effective direction could be to refine the strategies for deriving a set of vertices out of a set of supernodes. More concretely, instead of the one investigated in this work, which simply takes the union of the supernodes, one can also exploit the information of superedges to derive a more representative vertex set. Another idea could be to rerun the query on the union of the supernodes. In particular, this could intuitively work well for queries such as core decomposition, that only depend on intra-set edges. Conversely, this could be less effective for queries such as community detection, where inter-set edges matter as well.
Refine probabilistic-GPQPS algorithm
Another refinement could be to perform summary aggregation over the possible worlds, instead of query-answer aggregation. A major advantage of this is that aggregation over the worlds would be no more dependent on the form of query answer.
GPQPS on sparse summaries
As mentioned in Sect. 7 one of the main obstacles for a really effective
Miscellanea
Other interesting directions include deriving approximation guarantees for
Acknowledgements
Francesco Gullo and Lorenzo Severini were supported for this research by Project ECS 0000024 Rome Technopole—CUP B83C22002820006, “PNRR Missione 4 Componente 2 Investimento 1.5”, funded by European Commission—NextGenerationEU. Aris Anagnostopoulos was supported by the ERC Advanced Grant 788893 AMDROMA, the EC H2020RIA project “SoBigData++” (871042), the PNRR MUR project PE0000013-FAIR, the PNRR MUR project IR0000013-SoBigData.it, and the MUR PRIN project 2022EKNE5K “Learning in Markets and Society”. Giorgia Salvatori’s work for this research was carried out while she was an intern at UniCredit.
Author contributions
All the authors contributed equally to all the phases of the work.
Data availability
No datasets were generated or analysed during the current study.
Reproducibility
Data and code are available at https://github.com/fgullo/GPQPS
Declarations
Conflict of interest
The authors declare no competing interests.
Although other notions of “summary graph” do exist in the graph-summarization literature, in this work we consider the one that collapses nodes and edges into supernodes and superedges (Definition 1), which is the most common one.
2Experimental evidence of this claim is reported in Lee et al. (2022), for PageRank, triangle counting, breadth first search, and shortest path queries.
3In this work, we consider solely queries that operate on graphs. Thus, we hereinafter use the terms “graph query” and “query” interchangeably. Moreover, our notion of query corresponds to what is termed “query class” in some existing works (Fan et al. 2012, 2021, 2022).
4slightly abuses notation: .
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Abiteboul S, Kanellakis P, Grahne G (1987) On the representation and querying of sets of possible worlds. In: Proceedings of ACM international conference on management of data (SIGMOD), pp 34–48
Aggarwal, CC; Wang, H. Managing and mining graph data, advances in database systems; 2010; Berlin, Springer: [DOI: https://dx.doi.org/10.1007/978-1-4419-6045-0]
Aggarwal, CC; Wang, H. Aggarwal, CC; Wang, H. A survey of clustering algorithms for graph data. Managing and mining graph data, advances in database systems; 2010; Berlin, Springer: pp. 275-301. [DOI: https://dx.doi.org/10.1007/978-1-4419-6045-0_9]
Ahn KJ, Guha S, McGregor A (2012) Graph sketches: sparsification, spanners, and subgraphs. In: Proceedings of symposium on principles of database systems (PODS), pp 5–14
Anagnostopoulos A, Arrigoni V, Gullo F et al (2024) General-purpose query processing on summary graphs—supplementary material (https://github.com/fgullo/GPQPS)
Batagelj, V; Zaversnik, M. Fast algorithms for determining (generalized) core groups in social networks. Adv Data Anal Classif (ADAC); 2011; 5,
Beg MA, Ahmad M, Zaman A et al (2018) Scalable approximation algorithm for graph summarization. In: Proceedings of Pacific-Asia conference on advances on knowledge discovery and data mining (PAKDD), pp 502–514
Besta M, Hoefler T (2018) Survey and taxonomy of lossless graph compression and space-efficient graph representations. CoRR arXiv:abs/1806.01799
Besta M, Weber S, Gianinazzi L et al (2019) Slim Graph: practical lossy graph compression for approximate graph processing, storage, and analytics. In: Proceeedings of international conference for high performance computing, networking, storage and analysis (SC), pp 35:1–35:25
Biafore C, Nawab F (2016) Graph summarization for geo-correlated trends detection in social networks. In: Proceedings of ACM international conference on management of data (SIGMOD), pp 2247–2248
Blondel, VD; Guillaume, JL; Lambiotte, R et al. Fast unfolding of communities in large networks. J Stat Mech: Theory Exp; 2008; 10, P10008. [DOI: https://dx.doi.org/10.1088/1742-5468/2008/10/P10008]
Boldi P, Vigna S (2004) The WebGraph framework I: compression techniques. In: Proceedings of world wide web conference (WWW), pp 595–602
Boldi, P; Santini, M; Vigna, S. Permuting web and social graphs. Internet Math; 2009; 6,
Coscia M, Neffke FMH (2017) Network backboning with noisy data. In: Proceedings of IEEE International conference on data engineering (ICDE), pp 425–436
Dalvi N, Suciu D (2004) Efficient query evaluation on probabilistic databases. In: Proceedings of international conference on very large data bases (VLDB), pp 864–875
Fan W, Li J, Wang X et al (2012) Query preserving graph compression. In: Proceedings of ACM international conference on Management of Data (SIGMOD), pp 157–168
Fan W, Li Y, Liu M et al (2021) Making graphs compact by lossless contraction. In: Proceedings of ACM international confernce on management of data (SIGMOD), pp 472–484
Fan W, Li Y, Liu M et al (2022) A hierarchical contraction scheme for querying big graphs. In: Proceedings of ACM international confernce on management of data (SIGMOD), pp 1726–1740
Fazzone, A; Lanciano, T; Denni, R et al. Discovering polarization niches via dense subgraphs with attractors and repulsers. Proc VLDB Endowm (PVLDB); 2022; 15,
Fu X, Yu S, Benson AR (2019) Modelling and analysis of tagging networks in stack exchange communities. J Complex Netw 8(5)
Fung, WS; Hariharan, R; Harvey, NJA et al. A general framework for graph sparsification. SIAM J Comput (SICOMP); 2019; 48,
Galimberti E, Ciaperoni M, Barrat A et al (2021) Span-core decomposition for temporal networks: Algorithms and applications. ACM Trans Knowl Discov Data (TKDD) 15(1):2:1–2:44
Gionis, A; Mannila, H; Tsaparas, P. Clustering aggregation. ACM Trans Knowl Discov Data (TKDD); 2007; 1,
Gou X, Zou L, Zhao C et al (2019) Fast and accurate graph stream summarization. In: Proceedings of IEEE international conference on data engineering (ICDE), pp 1118–1129
Gullo F, Tagarelli A, Greco S (2009) Diversity-based weighting schemes for clustering ensembles. In: Proceedings of SIAM international conference on data mining (SDM), pp 437–448
Hagberg AA, Schult DA, Swart PJ (2008) Exploring network structure, dynamics, and function using NetworkX. In: Proceedings of the 7th python in science conference, pp 11–15
Hajiabadi M, Singh J, Srinivasan V et al (2021) Graph summarization with controlled utility loss. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 536–546
Hernández, C; Navarro, G. Compressed representations for web and social graphs. Knowl Inf Syst (KAIS); 2014; 40,
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of ACM symposium on theory of computing (STOC), pp 604–613
Jiang Z, Chen H, Jin H (2023) Auxo: a scalable and efficient graph stream summarization structure. In: Proceedings of the VLDB endowment (PVLDB) 16(6)
Jin, D; Yu, Z; Jiao, P et al. A survey of community detection approaches: from statistical modeling to deep learning. IEEE Trans Knowl Data Eng (TKDE); 2023; 35,
Kang S, Lee K, Shin K (2022a) Are edge weights in summary graphs useful? A comparative study. In: Proceedings of Pacific-Asia conference on advances on knowledge discovery and data mining (PAKDD), pp 54–67
Kang S, Lee K, Shin K (2022b) Personalized graph summarization: formulation, scalable algorithms, and applications. In: Proceedings of IEEE international conference on Data Engineering (ICDE), pp 2319–2332
Ke, X; Khan, A; Bonchi, F et al. Multi-relation graph summarization. ACM Trans Knowl Discov Data (TKDD); 2022; 16,
Khan A, Ye Y, Chen L (2018) On uncertain graphs. Synthesis lectures on data management, Morgan & Claypool Publishers, San Rafael
Khan, K; Nawaz, W; Lee, Y. Set-based approximate approach for lossless graph summarization. Computing; 2015; 97,
Ko J, Kook Y, Shin K (2020) Incremental lossless graph summarization. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 317–327
Koutra D, Kang U, Vreeken J et al (2014) VoG: summarizing and understanding large graphs. In: Proceedings of SIAM international conference on data mining (SDM), pp 91–99
Kumar, KA; Efstathopoulos, P. Utility-driven graph summarization. Proc VLDB Endowm (PVLDB); 2018; 12,
Lanciano T, Savino A, Porcu F et al (2023) Contrast subgraphs allow comparing homogeneous and heterogeneous networks derived from omics data. GigaScience 12
Lee K, Jo H, Ko J et al (2020) SSumM: sparse summarization of massive graphs. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 144–154
Lee K, Ko J, Shin K (2022) SLUGGER: lossless hierarchical summarization of massive graphs. In: Proceedings of IEEE international conference on data engineering (ICDE), pp 472–484
LeFevre K, Terzi E (2010) GraSS: Graph structure summarization. In: Proceedings of SIAM international conference on data mining (SDM), pp 454–465
Leskovec J, Krevl A (2014) SNAP datasets: stanford large network dataset collection. http://snap.stanford.edu/data
Liu X, Tian Y, He Q et al (2014) Distributed graph summarization. In: Proceedings of ACM international conference on information and knowledge management (CIKM), pp 799–808
Liu, Y; Safavi, T; Dighe, A et al. Graph summarization methods and applications: a survey. ACM Comput Surv (CSUR); 1982; 51,
Lloyd, SP. Least squares quantization in PCM. IEEE Trans Inf Theory; 1982; 28,
Maserrat H, Pei J (2010) Neighbor query friendly compression of social networks. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 533–542
Mehmood Y, Bonchi F, García-Soriano D (2016) Spheres of influence for more effective viral marketing. In: Proceedings of ACM international conference on management of data (SIGMOD), pp 711–726
Mosa, MA; Hamouda, A; Marei, M. Graph coloring and ACO based summarization for social networks. Expert Syst Appl; 2017; 74, pp. 115-126. [DOI: https://dx.doi.org/10.1016/j.eswa.2017.01.010]
Navlakha S, Rastogi R, Shrivastava N (2008) Graph summarization with bounded error. In: Proceedings of ACM international conference on management of data (SIGMOD), pp 419–432
Newman, MEJ; Girvan, M. Finding and evaluating community structure in networks. Phys Rev E; 2004; 69,
Riondato M, García-Soriano D, Bonchi F (2014) Graph summarization with quality guarantees. In: Proc. IEEE international conference on data mining (ICDM), pp 947–952
Riondato, M; García-Soriano, D; Bonchi, F. Graph summarization with quality guarantees. Data Min Knowl Discov (DAMI); 2017; 31,
Sadri, A; Salim, FD; Ren, Y et al. Shrink: Distance preserving graph compression. Inf Syst; 2017; 69, pp. 180-193. [DOI: https://dx.doi.org/10.1016/j.is.2017.06.001]
Schaeffer, SE. Graph clustering. Comput Sci Rev; 2007; 1,
Serrano, MÁ; Boguñá, M; Vespignani, A. Extracting the multiscale backbone of complex weighted networks. Proc Natl Acad Sci; 2009; 106,
Shin K, Ghoting A, Kim M et al (2019) SWeG: Lossless and lossy summarization of web-scale graphs. In: Proceedings of world wide web conference (WWW), pp 1679–1690
Slater, PB. A two-stage algorithm for extracting the multiscale backbone of complex weighted networks. Proc Natl Acad Sci; 2009; 106,
Spielman, DA; Teng, S. Spectral sparsification of graphs. SIAM J Comput (SICOMP); 2011; 40,
Toivonen H, Zhou F, Hartikainen A et al (2011) Compression of weighted graphs. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 965–973
Topchy AP, Jain AK, Punch WF (2003) Combining multiple weak clusterings. In: Proc. IEEE international conference on data mining (ICDM), pp 331–338
Tsalouchidou, I; Bonchi, F; Morales, GDF et al. Scalable dynamic graph summarization. IEEE rans Knowl Data Eng (TKDE); 2020; 32,
ur Rehman, S; Nawaz, A; Ali, T et al. g-Sum: agraph summarization approach for a single large social network. EAI Endorsed Trans Scalable Inf Syst; 2021; 8,
Ward, JH. Hierarchical grouping to optimize an objective function. J Am Stat Assoc; 1963; 58,
Yong Q, Hajiabadi M, Srinivasan V et al (2021) Efficient graph summarization using weighted LSH at billion-scale. In: Proceedings of ACM international conference on management of data (SIGMOD), pp 2357–2365
Zeng Y, Song C, Ge T (2021) Selective edge shedding in large graphs under resource constraints. In: Proceedings of IEEE international conference on data engineering (ICDE), pp 2057–2062
Zhou F, Mahler S, Toivonen H (2010) Network simplification with minimal loss of connectivity. In: Proc. IEEE international conference on data mining (ICDM), pp 659–668
Zhou, F; Qu, Q; Toivonen, H. Summarisation of weighted networks. J Exp Theor Artif Intell; 2017; 29,
Zhou H, Liu S, Lee K et al (2021) DPGS: degree-preserving graph summarization. In: Proceedings of SIAM international conference on data mining (SDM), pp 280–288
Copyright Springer Nature B.V. Dec 2024