Content area
Triangle enumeration is a foundation brick for solving harder graph problems related to social networks, the Internet and transportation, to name a few applications. This problem is well studied in the theory literature, but remains an open problem with big data. In this paper, we defend the idea of solving triangle enumeration with SQL queries evaluating the steps of a new adaptive algorithm with linear speedup. Such SQL approach provides scalability beyond RAM limits, automatic parallel processing and more importantly: linear speedup as more machines are added. We present theory results and experimental validation showing our solution works well with large graphs analyzed on a parallel cluster with many machines, producing a balanced workload even with highly skewed degree vertices. We consider two types of distributed systems: (1) a parallel DBMS that evaluates SQL queries, and (2) a parallel HPC cluster calling the MPI library (called via Python). Extensive benchmark experiments with large graphs show our SQL solution offers many advantages over MPI and competing graph analytic systems.
Introduction
Triangle enumeration is a fundamental graph problem used as a building block to solve complex graph problems. Triangle enumeration underlies numerous graph analytic applications including social networks, biochemical interactions, and the Internet. Indeed, people tend to choose friends who share similarities with them [1], and friends of friends tend to become friends themselves [2]. These facts are usually translated by triangle analytic applications, where triangles are mainly generated by homophily or transitivity [3]. Examples of applications build based on triangle detection are: (i) clique detection, for which triangle detection is the first step of the solution, (ii) the clustering coefficients and the transitivity of a graph considered as crucial metrics computed based on triangle density used to determine the age of a community and the (iii) graph summarization aiming to reduce the graph volume by substituting every three vertices with a triangle (called a super-vertex) which imitates a vertex in the summarized graph. In addition, triangles are involved in a variety of Internet applications. We can cite the cases of (a) uncovering hidden thematic structures [4] on the Web graph to detect similarities, by using the transitivity property of the triangles to extend seeds to larger subsets of vertices with similar thematic structure, (b) Spam detection [5], in which the distribution of triangle count among the spam and non-spam hosts is large at a detectable level, that allows an easy classification of a host, (c) content quality and role behavior identification of a user on a forum [6], where the number of triangles in which a user participates plays a critical role to identify the interactions between participants, (d) recommendation systems [7], in which triangle enumeration is largely used to determine the similarities between users and then target their friends based on the shared interests, and (e) computer-aided design (CAD) applications [8] where the triangle counting is essential in the graph-constructive approach to solving systems of geometric constraints. Other applications can be found in [3, 9].
The above examples show the essential role played by triangle detection in developing modern data science applications. This motivates academia and industry to develop efficient algorithms. Due to the large-scale graphs, it becomes almost impossible to fit the entire graph into the main memory of a single machine. Therefore, distributed infrastructures and algorithms have been proposed to reach scalability [10, 11, 12, 13–14]. Indeed, when the vertices forming the triangle are hosted on different machines, a huge message passing is performed to gather each triangle’s vertices on one machine and will be accumulated to compute all triangles in a parallel way. This may overload the memory and the bandwidth.
The one proposed by [13] offers the optimal worst-case time complexity which is , where m represents the number of all edges of the studied graph. In order to satisfy scalability and minimize message passing, these algorithms use data partitioning by reshuffling the edges. By analyzing these studies from a data partitioning angle, we figure out that they either benefit from the partitioning strategies offered by the engines processing the graphs or design their own strategies. The work of [14] is an example of those that get benefit from engine data partitioning since it uses Hadoop as a data processing engine to partition the graph into small chunks, then detect the triangles within each chunk in several MapReduce rounds. [13] is an example of a system that uses its own partitioning strategy. It consists in assigning in a uniform way a color to each vertex of the graph and exploiting these colors (given as an input of the algorithm) to conduct the splitting. Triangles are then output on each machine according to vertex coloring. These two types of partitioning strategies do not guarantee the elimination of the communication cost during the enumeration, except when some partitions are fully replicated which is not a reasonable solution.
One of our main learned lessons that we got from our analysis of existing studies related to distributed algorithms for triangle enumeration (counting) is the necessity to associate them with an adequate infrastructure that contributes to (i) minimizing message passing, (ii) ensuring scalability, and (iii) making easy the deployment of the algorithm by data scientists either in modest or sophisticated infrastructures. The infrastructure’s architecture is a shared-nothing and based on the model [15] well adapted for big data processing including graphs.
Keeping these motivations in mind, we present a novel adaptive scalable optimized algorithm with its associated architecture based on balanced data distribution for both triangle enumeration and counting. Our ultimate goal is to provide a generic solution with an efficient data partitioning strategy for the triangle enumeration problem. Our contributions cover architectural, algorithmic, theoretical, and experimental aspects. They are summarized as follows:
The proposal of a randomized parallel algorithm for triangle enumeration and counting,
The theoretical study of its properties from a distributed infrastructure perspective, and its correctness rules,
The optimization of the data partitioning allowing independent local computations to be performed without machine coordination or memory limitation,
The definition of an optimized computational model built on shared-nothing architecture,
The optimization of memory space by eliminating the repetitions before the local triangle enumeration (counting),
An advanced experimental study to evaluate our prototype efficiency, and compare it with state-of-the-art solutions.
Preliminaries
Graph
Let G(V, E) be an undirected unweighted graph, where V and E represent respectively the set of vertices (nodes) and edges. We denote and . Each connects a pair of vertices (i, j). For each gives the set of neighbors of a vertex i. The graph G can be represented by a binary and symmetric adjacency matrix E of size , where holds 1 if there is an edge between i and j and zero otherwise. Assuming E is sparse, E is stored on an edge list E(i, j), skipping zero entries. Figure 1(a) depicts an undirected graph, (b) shows its adjacency matrix representation, and (c) illustrates its edge list representation.
[See PDF for image]
Fig. 1
Graph Representations
Triangle properties
Definition 2.1
Given an undirected graph G, a triangle denoted by is a triple of vertices mutually adjacent in G. It assumes three edges , and . The set includes all the triangles of G.
Figure 2b represents the triangles of the graph G described in Fig. 2a. In real-world applications (e.g. social networks and transportation networks), most graphs are sparse. Therefore . However, enumerating embedded triangles is computationally hard with time complexity .
[See PDF for image]
Fig. 2
Triangles of the Graph
Definition 2.2
The triangle is the smallest clique in a graph G and a cycle (a path starting and ending at the same vertex) of length 3.
Thus, the triangles intervene in solving two fundamental graph problems: (1) the determination of maximal cliques, and (2) the transitive closure. For a graph G, the triangle enumeration produces the list of the unique triangles in , which is more expensive than counting. As a result, the counting is easily obtained using the enumeration results. In contrast, it does not necessarily provide the triangle list [16].
Parallel architecture
Our adaptive algorithm is based on the classic model (a.k.a Big Data model), explained in [15]. It consists of a set of machines built on a shared-nothing architecture. Each machine can communicate bidirectionally and directly with the other machines via message passing (no shared memory) while running an instance of the distributed algorithm. To achieve the best performance of our system, the model machines must define a uniform configuration.
Notation summary
We summarize in Table 1, all the notations used in our paper.
Table 1. Notation Summary
Notation | Description |
|---|---|
(i, j) | An egde with i as a source vertex and j as a destination vertex |
m | Edge set size (graph size) |
n | Vertex set size (graph order) |
Triangle set of the input graph G | |
A triangle | |
k | The count of machines in the computational model |
c | The count of colors |
Our triangle enumeration and counting algorithm
As mentioned previously, enumerating triangles in a distributed or parallel system is challenging due to unbalanced loads and costly message passing, especially with large skewed graphs. Before presenting our main contribution, we review the classic algorithm for triangle enumeration. Based on that framework, we identify weaknesses from a load-balancing perspective, which can be solved with a novel theoretical adaptive algorithm with linear speedup, which had not been applied in databases. Then, we propose general SQL queries which evaluate each step of the adaptive algorithm. Moreover, we introduce optimizations which help reducing the required number of machines and important query optimizations that work well across SQL engines. To round up our contributions, we also present an efficient implementation with Python calling MPI, which is faster, but not scalable with large graphs.
Classic triangle algorithms
According to Cohen [17], listing triangles can be performed in two steps.
Identifying all the directed wedges (pairs of connected edges),
Finding whether there is an edge connecting the endpoints of each wedge.
From a SQL implementation perspective, the classic algorithm can be computed with SQL queries, which are automatically executed in parallel. However, this may work with unbalanced edges between machines or produce unbalanced triangles. Precisely, Algorithm 1 can be translated into SQL by two self-joins of the edge table E(i, j) i.e. , where each join uses . Notice that the counting is trivial and can be easily obtained by applying a count aggregation on the results. The above-mentioned algorithm can be rewritten as follows:
The above SQL query does not specify the used partitioning strategy, because it is the responsibility of the DBMS hosting the studied graph. In a parallel DBMS, the table E is usually partitioned by hashing the source or destination vertex, and the obtained fragments are then allocated to respective machines. This partitioning scenario may increase the number of edge exchanges when performing each join, since the edges that satisfy the join condition are potentially distributed over different machines. The fact that we have to perform two joins, the number of edges to be exchanged can be enormous.
Figure 3 shows a possible classic algorithm output of the graph in Fig. 1. Only three machines output triangles in unbalanced manner.
[See PDF for image]
Fig. 3
Unbalanced Triangle Output using the Classic Algorithm ()
Adaptive triangle algorithm
We believe that DBMS is suitable to implement a triangle enumeration algorithm with some adjustments by taking lessons for implementing a parallel version of the classic algorithm. This is due to the different optimizations that a DBMS offers and its usage of a reasonable memory budget. Our analysis of the no-satisfaction of the classic algorithm load balancing shows that it is mainly due to the used data partitioning mode (like hash, range, and list) owned by DBMS and is not initially defined for graph management.
Therefore, we propose in this paper our own intelligent data partitioning strategy. More concretely, it splits vertices into subsets and then computes in parallel the triangles within each subset. It should be noticed that we will provide not only SQL solutions for our algorithm, but also a Python implementation. We summarize the main steps of the adaptive algorithm as follows.
Vertex set partitioning,
Collecting needed edges on the local machines,
Performing local triangle enumeration (counting) in parallel.
Figure 4 illustrates our algorithm for triangle enumeration. Each partition of vertices is assigned a color from c colors. Note that data communication is required in steps 3 and 4. Otherwise, the processing is local.
[See PDF for image]
Fig. 4
Triangle Enumeration (counting) Process of the Adaptive Algorithm
Optimized number of machines (k)
The model is suitable for triangle computation and it can be efficiently used by our algorithm since it is mainly intended for large-scale processing. In this model, the number of machines k should respect the Eq. (1), where represents the number of colors used to partition the vertex set (see Sec. 3.2).
1
This allows each machine to handle one triplet of colors, and only output the triangles according to it. As a result, as the number of colors increases, the algorithm requires an excessive number of machines, which is difficult to provide in most cases. An important optimization is to reduce the number of machines from (1) to (2).2
Using a model sized according to Eq. (2), each machine can handle c triplets of colors (the assignment is hard-coded in the algorithm). Hence, we can easily increase the number of colors and require fewer machines than the theoretical computational model. This optimization defines lots of advantages, including less hardware and software setup, while preserving the balance of the workload. Indeed, the number of edges is relatively balanced between machines when partitioning the vertex set. Recall that this partitioning is uniform, which means that the vertex subsets are of equal-size. As a result, each machine handles the same number of edges and then collects its missing edges from proxy machines. Thus, each machine enumerates triangles according to all its corresponding c triplets, in a balanced way. For the rest of the paper, we consider .Adaptive algorithm computed by SQL queries
Our current implementation for the adaptive algorithm is convenient with SQL, and it can be run on any parallel distributed database system defining a shared-nothing architecture. Nevertheless, it has the potential to work with Python Pandas library, since our joins are local (see Sect. 3.5).
To enumerate and count triangles, we opt for SPJ operations (selection, projection, and join). In particular, the selection () operator is used to apply filters, the projection () operator is employed to choose specific columns, while the join () operator merges multiple records according to matching columns. In the database language, this is equivalent to WHERE condition, SELECT Columns, and JOIN between tables, respectively. In this section, we explain how to adapt the partitioning strategy to work perfectly on the DBMS, and how to count and enumerate triangles locally.
Data set loading
To read the graph into the model (step 1 of Fig. 4), the relational table is used. For that, we distinguish two options: (1) importing the graph from an external data set, or (2) creating the graph from pre-existing database relations. For the former, we need to use the DBMS query to import data, which usually depends on the DBMS. Whereas for the latter, works like [18, 19–20] can be exploited. We present Q1 below to express the data set loading into the DBMS, and we provide Fig. 5, in which we depict the graph reading from Fig. 1 into the DBMS. The fact that we are only taking into account one direction for each edge with the property allows us to eliminate the repetitions at the beginning and save memory space, especially for large data sets. Indeed, the triangle problem is applied on undirected graphs, so one direction of each edge is needed to ensure its presence in the table . Notice that the data set loading defined in the adaptive algorithm is equivalent to the data set loading of the classic algorithm.i.e both present unbalanced workload.
[See PDF for image]
Fig. 5
Table after Graph Loading: Unbalanced Edge Distribution ()
Adaptive vertex partitioning
Recall that our partitioning strategy aims to partition the vertex set V into c subsets, each containing vertices. Then, all the edges between pairs of subsets are randomly sent to proxy machines. Finally, the local machines collect all the edges from the proxies according to their triplets. From a database perspective, the graph is defined as an edge table, so to create a table (see Table 2) that hosts the colored vertex set, a union between column i and column j is executed on the table . Mainly, the table ensures that each vertex picks independently and uniformly at random one color from a set of c distinct colors, using User Defined Functions (UDF) or existing database functions for a uniform random choice (step 2 of Fig. 4). Given that the vertex coloring function is uniform and independent, this partitions the vertex set into subsets of equal size. Note that the table is in O(n) because it contains all the vertices and holds at most one additional byte per vertex to store its color. Then, a deterministic assignment of color triplets through the table Triplet(machine, color1, color2, color3) (see Table 3), assigns each of the possible color triplets formed by the c distinct colors to one distinct machine. Note that each machine in Table 3 is mapped with two color triplets, since we are using the optimized model (4 machines). The queries Q2 and Q3 below are for vertex set partitioning and color triplet mapping respectively.
Table 2. : Black (), White () ( and )
i | Color |
|---|---|
1 | 0 |
2 | 1 |
3 | 1 |
4 | 0 |
5 | 1 |
6 | 0 |
7 | 1 |
8 | 0 |
9 | 0 |
10 | 0 |
11 | 1 |
Table 3. Triplet: Black (), White () ( and )
Machine | Color1 | Color2 | Color3 |
|---|---|---|---|
1 | 0 | 0 | 0 |
1 | 0 | 0 | 1 |
2 | 0 | 1 | 0 |
2 | 0 | 1 | 1 |
3 | 1 | 0 | 0 |
3 | 1 | 0 | 1 |
4 | 1 | 1 | 0 |
4 | 1 | 1 | 1 |
Next, each machine designates one random machine from the model as an edge proxy for each edge it holds. Then, it sends all its edges to the respective edge proxies (step 3 of Fig. 4). For that, the table is created (see Fig. 6). It holds all the edges of the table with their end-vertices picked colors. Therefore, it is in O(m) where each record defines two supplementary bytes to store the edge end-vertices colors. The table is obtained from a double join between and on the columns and . Building is the most important step because all the partitioning depends on it. Indeed, the rebalancing is mainly based on partitioning edges in proxies in a balanced fashion, which allows local machines to collect colored edges from proxies in order to perform a local triangle enumeration.
[See PDF for image]
Fig. 6
Table (, ): Balanced Workload ( and )
The table is then built and partitioned by the machine column (see Fig. 7). Recall that we assign c color triplets to each machine. So each local machine collects its needed edges from edge proxies according to its triplets of colors (step 4 of Fig. 4). Note that the edges are replicated between the local machines. This is because the local machines collect edges according to the end-vertices colors. As a result, the edges are not balanced at this stage. On the other hand, the triangles are balanced between machines, where each machine holds . In order to allow each physical machine to process edges according to the triplets assigned to it, we exploit the DBMS built-in hash function to distribute the rows. Explicitly, we partitioned the table by its column machine to have each physical machine holding one instance of the column machine. i.e. the physical machine H1 will hold all the edges whose column . We translate the creation of tables and by the queries Q4 and Q5 resp.
[See PDF for image]
Fig. 7
Table (, ): Balanced Workload ( and )
Local triangle enumeration
To enumerate triangles locally and in parallel, each machine examines its edges whose endpoints are in two distinct subsets among the three color subsets assigned to it. This happens in two steps:
Each machine lists all the wedges whose vertices correspond to one of its triplets,
To output triangles, each machine checks whether there is an edge connecting the end-vertices of each listed wedge.
[See PDF for image]
Fig. 8
Balanced Triangle Output on each Machine ( and )
Database algorithm optimizations
In the following, we introduce numerous optimization techniques that are convenient with all types of DBMS. These techniques can be used with on-premise DBMS or DBMS on the cloud. They ensure the acceleration of the execution of our queries and the accuracy of our results.
Small table replication: this optimization ensures the quick access and availability of required data on each machine. Indeed, the presence of small tables in the main memory on each machine accelerates dramatically the join execution. Therefore, we considered the replication of the table Triplet on each machine using the statement UNSEGMENTED ALL NODES (this statement may vary from one DBMS to another), in order to run locally the joins with this table.
Partitioning: it aims to split a large data set into several small partitions and distribute them on different machines. This ensures a parallel execution and a considerable speed up of the queries. In addition, the DBMS feature that allows tables to be sorted by join columns on each machine can also be used to improve the performance of joins.
Edge table replication: this optimization is intended to create an exact copy of the table participating in the self-join. As a result, sorting the replicate and the original tables based on their predicate columns accelerates the local join. For instance, the table is replicated in our algorithm, when performing the local triangle enumeration. This allows the table to be sorted by the column j and it replicates by the column i. Thus, the local join is accelerated, because the DBMS can match rapidly between the two columns .
Duplicate triangle elimination: in our algorithm, we only consider triangles defining lexicographical order (from the lowest vertex ID to the highest vertex ID). In addition, we read the graph in one direction , which allows considering a lexicographical order between vertices before the triangle enumeration and counting.
Compression: the columnar DBMS uses compression to reduce and save the required total storage space, so we opt for RLE encoding, where a contiguous list of identical values is converted to a set of . Note that for row DBMSs, we may have other value compression techniques that are suitable with their structures. For example, Postgresql defines many compression methods including Zson; a json extension compression, and Cstore extension compression.
Projection: the projections are optimized collections of table columns that provide physical storage for data in a columnar DBMS. They are similar to indexes on a row DBMS. We created the projections , and for the tables , and respectively, and partitioned them by Hash(i). Then, we created the projection for the table , and partitioned it by Hash(machine).
Partitioning vs Segmentation: in the columnar DBMS language, partitioning means dividing rows locally on each machine according to one or more columns, whereas segmentation refers to splitting projection across the cluster machines. In the rest of the paper, we will use these two concepts according to the columnar DBMS.
Adaptive algorithm programmed in python and MPI
Python is a popular language for graph analytics. Pandas is a fast, powerful, flexible, and easy-to-use Python library for data manipulation. It defines a powerful data structure which is the DataFrame. We chose this library because we wanted to present a solution that could be easily integrated into the data analysis pipeline, especially that Pandas is widely used by data scientists. In addition, we use MPI to ensure the data communication between machines, since we are employing the model. To formulate our adaptive algorithm in Python, we used the same formulation as DBMS, where the is equivalent to pandas.filter(), is equivalent to E[[columns]] where E is pandas’ dataframe, and refers to pandas.merge(). Moreover, the input graph data set is also stored in the same format as in the DBMS. We read the graph as an edge list in the dataframe . In an analogous manner to the DBMS version, we store one direction of each edge (i, j) where .
Adaptive Vertex Partitioning
This step aims to eliminate data exchange during the triangle enumeration (counting). Hence, each vertex and its incident edges need to be on the same machine. As explained in Sect. 3.4, the first step is vertex coloring, so the dataframe is created based on the edge list , where column i and j are unionized together and random() is used as a method to pick a unique color for each vertex, in uniform and independent manner. Then the edges are sent to proxies using the dataframe , which is the result of merging and . Note that these dataframes are partitioned on the cluster using MPI collective operations. The last step in this phase is to collect the edges from the proxies, for that, we add a small dataframe , which associates each machine with its corresponding triplets of colors. The dataframe is then created. It allows each machine to have edges whose end-vertices are in two subsets of the three subsets of colors assigned to it. This dataframe is partitioned in a manner that each machine hosts a subgraph of vertices with their incident edges, in order to allow local processing in the next phase.
Local triangle enumeration
In this last phase, two merges are performed between and itself to enumerate all the triangles in the graph. Since each machine has its required edges to locally enumerate the triangles, there will be no data communication between the machines.
Comparing SQL and python/MPI solutions
Comparing with Python/MPI, SQL language is more elegant, abstract, and easier to understand with basic SQL knowledge, it mainly uses SPJA (Select, Projection, Join and Aggregation) operations. Furthermore, all relational operators in SQL naturally work in parallel. Hence, there is no need to explicitly send/receive messages with buffers and replicate files in SQL implementations. Thus, a SQL implementation can work on both one node and a parallel cluster without any reprogramming, while adapting a serial solution to a parallel solution with MPI is complicated and time consuming. It is true that SQL language has some limitations:(1) it requires loading files, (2) every input and intermediate result must be stored in tables, and (3) it is hard to express complicated logic due to little control on data structures in RAM. However, in the triangle problem, all those are acceptable. Both the input graph and output triangles can be stored in table format, and our algorithm can be expressed in SQL with the help of partitioning strategies provided by database systems. What is more, SQL solutions have no memory limitations while Python/MPI implementation fails when graphs are too large to fit in the main memory. All these statements are experimentally proven in Sec. 5.
Algorithm correctness and complexity analysis
In this section, we investigate the correctness of our results, the load balancing, and the time complexity of our adaptive algorithm.
Correctness of our sesults
Lemma 4.1
All the triangles are output once.
Proof
To prove that all triangles are output, we define the following properties:
Property 4.2
The graph is read in such a way that only one direction of each edge is imported that defines .
Property 4.3
The table Triplet assigns each color triplet to one machine M.
Property 4.4
Each v of picks uniformly and independently at random one color from the c colors, so each edge will be colored with .
Property 4.5
Each edge of is sent to the machine defining its end-vertices colors.
According to property 4.2, when the graph is imported, one direction of each edge is inserted into , hence all the edges are present in . Thus, the table holds all the graph vertices since it is derived from . These vertices will be colored in a uniform manner according to property 4.4. Then, each edge e will be in two subsets of colors in table , hence it will be sent to the machine M defining these two color subsets according to property 4.5. When we create table , we make sure that each machine M in the model collects its required edges from proxies according to its triplets . Hence, if an edge is colored with , then this edge is sent to M if
3
Notice that each edge defines , however, in our triangle enumeration query, we require that when performing (E1, E2 and E3 are alias of ). Thus, we invert the columns i and j in the table when inserting edges that satisfy (refer to query Q5). Hence, all the required edges for triangle enumeration are sent to the respective machines. Recall that the parallel DBMS defines a built-in hash function that allows segmenting data across the cluster’s machines. Hence, table is segmented by the column machine. Then, each physical machine holds all the edges corresponding to its triplets and performs locally the triangle enumeration task. In other words, each machine enumerates locally and in parallel, all the triangles whose vertices respect the order and the color of one of its triplets.In addition, each triangle is output once, according to the following property:
Property 4.6
if is output as a triangle, then (lexicographical order).
Recall that a triangle has six versions in different orders and directions. Only one version should be output that defines a lexicographical order between vertices. In our query Q6, we applied a filter on edges when we performed the self-joins. Indeed, we required that the first and the second edges forming the triangle have , while the third edge defines . As a result, we preserve the order between vertices and we output one version of each triangle. Notice that each machine of the optimized model manages c triplets of colors. However, each triangle is output once because each triplet of edges forming a triangle must: (i) have end-vertices corresponding to one of the machine color triplets, and (ii) define a lexicographical order between the triangle vertices.
Lemma 4.7
There is no missing triangles in the output .
Proof
Recall that property 4.2 ensures that all the edges are present in the table . Hence, we are confident that all the graph triangles can be computed from the table . In addition, each vertex v in the table chooses uniformly a color from the c colors, according to property 4.4. As a result, the end-vertices of each edges in will be colored and sent to proxy machines via the creation of the table . Note that all the edges of are present on . Once again, all the graph triangles are in . Then, all the edges in are sent to local machines by creating the table . Noticed that all the possible permutation of colors are present in one or more triplets. Hence, each edge of will be sent to one or more machines. Therefore, all triangles are also present on . Finally, if the edges , and of the table have the colors , and respectively, they will be automatically sent to machine M mapped with the color triplet when creating the table . If and these edges form a triangle, this latter will be output on M according to lemma 4.1, and there will be no missing triangles in our algorithm.
The above proof is also valid for our solution programmed in Python and MPI, since it is equivalent to the SQL query solution. Indeed, when we import the graph, all the edges are present on the dataframe , since we import one direction of each edge. The table is scattered on the , then the vertex vector is created and communicated between the machines, hence each machine knows the colors of all vertices. After that, the edges between colored vertices are sent to proxy machines by scattering the dataframe . Notice that all the edges of are in . So, there is no missing triangles at this stage. Sending edges to proxies allows local machines to collect edges in the dataframe according to their color triplets. This is performed using a scatter operation, each machine knows the color triplets of the other machines, so it sends all the edges that are in two subsets of each color triplet to its respective machine, and receives its edges from the other machines. As a result, all the edges are in and there is no missing triangle. Each machine then, has all the edges to output locally and in parallel, all the triangles that respect the color order and the lexicographical order. This ensures that all the triangles are output once and there is no missing triangle since all the triangle end-vertices are colored and correspond to one of the color triplets.
Load balancing of our algorithm
We mentioned previously that each vertex of the set V picks uniformly and independently a color from c distinct colors. This gives rise to a partition of the vertex set V into c subsets of vertex each. Each machine then receives a subgraph of G. As mentioned in Sec. 3.2, the number of edges among the sub-graphs is relatively balanced with high probability. Without loss of generality, let us suppose we have two colors () and let them be black (b) and white (w). Each vertex of the table chooses uniformly between the two colors, so we will have of white vertices and of black vertices in the table . Each edge’s end-vertex will have possibility to be (b) or (w). Then each edge will have possibility. Either it is (b, b), (w, w), (w, b) or (b, w). Thus, the table holds of each edge’s color combination, and it is segmented across the cluster’s machines based on these color combinations. In fact, each machine of machines processes edges, which leads to a balanced edge distribution on the table . Moreover, all the model’s machines output the same number of triangles. Each local machine collects edges from proxies according to its triplets of colors, this implies a balanced load in terms of triangles. This means that each machine outputs the same count of triangles. Indeed, since each edge has possibilities of colored end-vertices, each colored triangle has possibility. This means that each color triplets results in . For example, the triangles whose vertices colors are (b, b, b) are hosted on machine machine1 defining that color triplet, and represent of the total triangles (), because each vertex has possibility to choose (b). Recall that the vertex set partitioning is uniform and independent, hence all the vertices have an equal chance to choose between (b) or (w). Precisely, the table holds for each local machine, all the edges whose end-vertices are in two subsets of the three subsets of each of its color triplets. This means that each machine holds . This is due to the fact that each edge end-vertex has to appear in a triangle, and each triangle requires three vertices so it has of possibility to appear using one color triplet combination. However, each machine manages c colors triplets (see Sec. 3.3), so it outputs , where the numerator (c) refers to the number of triplets managed by the local machine, and the denominator () represents the probability of the colored triangle output. For instance, the table holds with two colors (b) and (w).
Time complexity of our algorithm
The running time taken by each machine of the model is proportional to the number of edges and triangles it handles. Recall that there are four main steps in our algorithm, the most time-consuming steps are triangle enumeration (counting) and edge collection by local machines from proxies. In the database perspective, they are costly because multiple join operations are performed. Thus, we will mainly analyze the time complexity of the join operations in these steps. In addition, we will analyze the time complexity of the adaptive algorithm programmed with Python and MPI.
Adaptive algorithm with SQL queries
The first step of our adaptive algorithm with SQL queries is vertex coloring, in which we uniformly assign a color to each vertex to create the set . To get all the vertices, we select both distinct source vertex i and destination vertex j from the edge list and union them together. Thus, the edge list is read twice. Note that we need to perform a sort when selecting these distinct values. The edge list is segmented across the cluster by HASH(i). At the same time, the partitions are sorted by i on each machine. Consequently, we only need to scan the edge list partition on each machine when we select distinct values of i, with O(m) as the time complexity. However, the edge list is needed to be resegmented and sorted when selecting distinct values of j. So, the time complexity is . Next, edges are sent to proxies by creating the colored edge end-vertices set (table in SQL). This includes two joins between the edge list and the colored vertex set . The base step is , then the derived result is joined with one more time. So, the set is read twice, and the edge list is read once in this step. Note that the join algorithm can be either a hash join or a merge join in SQL (in our used parallel DBMS). The time complexity of the join operator is for a sort-merge join if one of the two join tables is sorted, and it becomes O(m) if both tables are sorted. In our algorithm, the merge join is used when joined tables are sorted on the join columns. Otherwise, the hash join is used, which is the case for the self-joins. So, the optimizer creates a hash table and loads it into the main memory, so that it can search for the matches between the outer table and the hash table. The worst-case time complexity of the hash join is especially for highly skewed data (cliques), while the estimated average of time complexity is O(m) assuming a uniform distribution and low-degree vertices. In this step, both and are segmented by i across the cluster and sorted by vertices i on each machine. Thus, the implementation of the first join is the merge join with the time complexity O(m), because the join tables are sorted on the join column. However, for the second join condition, which is , a resegmentation is needed, hence the hash join is used with time complexity of O(m) on average and in the worst case. Then, we collect edges from other edge proxies by creating . For that, we join with Triplet three times and union the results together. Consequently, will be read three times. Those three join operations are all hash joins because the join tables are not sorted by join columns. Recall that we replicate Triplet all over the cluster computing machines. It will stay in the main memory during the join process. Each time a record of is retrieved, it will be joined with the whole Triplet. Consequently, all records in will be retrieved once in each join, and Triplet will stay in the main memory and be repeatedly used. The time complexity of the hash join for this step is O(m) assuming the size of is O(m). Finally, is joined with itself twice in the triangle enumeration (counting) step. Hence, it will be read twice from the disk. In the first join, is joined with on and . Recall that we create in a way that each local machine has the required edges to process triangle enumeration (counting) locally. Thus, no segmentation is needed for the first join operation. In the next step, the temporal derived table is joined with once more on , , and . Similarly, no segmentation is needed for the second join operation. Those two join operations are all hash joins. Thus, our partitioning strategy eliminates the communication between machines during the join operations. Since we partition across the cluster and no edge is needed to be communicated between machines, the local join could be perfectly parallelized. Suppose the size of is O(m), the time complexity of the join operation is O(m/k) for a tree graph to for a highly skewed graph, where k is the number of machines in the cluster.
In summary, will be read four times, will be read three times, and will be read twice during the entire process on the DBMS. The size of all those tables is O(m). Thus, the I/O cost of the adaptive algorithm with SQL queries is O(9m), which is O(m).
Adaptive algorithm with python and MPI
In this solution, the edges are read in the dataframe from the input graph data set in O(m), then the dataframe is created by concatenating both columns (i, j) of in order to extract all vertices. The cost of the concatenation function of Pandas is O(m). Then, a process of deleting the repetition is executed in O(m). The dataframe is broadcast to all machines in order to compute the dataframe in . This dataframe is computed locally using two merge operations on column i then on column j. The cost of a merge operation in Pandas is O(m), but since the processing is local, it would cost . The is then scattered in a random manner to all machines, in order to balance the edges distribution in . Notice that the cost of the scatter operation in MPI is O(log(k)) for each message, but since we scatter on each machine edges in parallel, the operation would cost in total. Then, the dataframe is created and processed locally with a concatenation between three merges between the dataframe and the dataframe Triplet. As mentioned previously, the concatenation costs is and the merge cost is due to the local processing. Therefore, each machine will process the in in the worst case, taking in account that the dataframe Triplet is broadcast to all the machines beforehand in . Finally, the triangle enumeration is performed locally and in parallel on the dataframe using a double merge. This process costs . This is due to the fact that we apply a filter on before the enumeration process, in which we take all the edges that satisfy for the first self-join and the edges where for the last self-join. With a graph uniform distribution, the time complexity of the triangle enumeration would be . However, the processing is distributed across the model. As a result, the triangle enumeration time complexity is . Note that the communication cost is light compared to the triangle enumeration cost. Its time complexity mainly depends on the bandwidth and the number of processors in the model.
Experimental evaluation
Our experiments focus on two major objectives:
Evaluating the load balancing of our adaptive algorithms to prove the efficiency of the partitioning strategy,
Comparing our approach with competing graph systems, to prove that our adaptive algorithm is efficient besides being short, elegant, and abstract.
Hardware and software configuration
Hardware: our experiments were conducted on a cluster with 16 machines. Each machine has 4 cores CPU running at on average, of main memory, of storage, L1 cache, L2 cache and Linux Ubuntu server 18.04 as operating system. The total of RAM on the cluster is , and the total disk storage is , with a total of 64 cores for processing. The machines are connected on 1GB network cards with 128MB/s as bandwidth.
Software: we chose Vertica; a columnar DBMS to execute our adaptive algorithm with SQL since it is 10 faster than the row DBMSs for graph problems. However, any other parallel DBMS that provides partitioning control can be used. For HPC cluster implementation, we used Pandas; a Python library as a data analysis library to execute the algorithm on local machines, and mpi4py library based on MPI to manage the data communication between machines. These tools are for the current version of our algorithm and they will be extended to other systems like Spark SQL and TigerGraph in the future. We compared our results with Spark Graphx; a Spark’s API for graph analysis and G-thinker which is a distributed graph mining system.
Data sets
Table 4 summarizes the used data sets for our experiments. We exploit seven real data sets from two different sources [21, 22]. For each data set, we exhibit its order (n), size (m), the expected triangle count (), the maximum degree (), and its size on disk.
Table 4. Data sets
Data sets | n | m | Size | ||
|---|---|---|---|---|---|
Hyves | 1402k | 2777k | 752,401 | 31,883 | 40MB |
Youtube | 2987k | 1134k | 3,056,386 | 28,754 | 38MB |
as-Skitter | 1696k | 11,095k | 28,769,868 | 35,455 | 149MB |
Flickr-links | 1715k | 15,551k | 548,174,465 | 27,224 | 194MB |
Live-journal | 3997k | 69,362k | 177,820,130 | 14,815 | 501MB |
DBpedia | 18,268k | 172,183k | 328,743,411 | 632,558 | 2.71GB |
Dimacs | 18,483k | 261,787k | 4,451,687,605 | 194,955 | 4.32GB |
Evaluation of adaptive algorithm with SQL queries
Here, we present the evaluation of the query speedup, the memory usage, and the load balancing of the adaptive algorithm with SQL queries.
Adaptive algorithm speedup
To study the speedup of our algorithm computed with SQL queries, we provide Table 5 in which we compare the triangle enumeration (counting) running time by varying the size of the optimized cluster from 1 to 16 ( and ). Note that executing the adaptive algorithm on one machine discards the partitioning strategy execution, it directly runs the triangle enumeration query.
Table 5. Execution Time (in seconds) and Speedup of the Adaptive Algorithm with SQL Queries ( and resp.)
Data sets | 1 Machine | 4 Machines | 9 Machines | 16 Machines | 4 9 | 416 |
|---|---|---|---|---|---|---|
Hyves | 12 | 22 | 16 | 12 | 1.7 | 1.8 |
Youtube | 9 | 16 | 13 | 9 | 1.2 | 1.7 |
as-Skitter | 57 | 55 | 47 | 30 | 1.5 | 1.8 |
Flick-links | 531 | 224 | 160 | 81 | 1.4 | 2.7 |
Live-journal | 818 | 588 | 187 | 89 | 3.1 | 6.6 |
DBpedia | Crash | 7689 | 4026 | 3760 | 1.9 | 2 |
Dimasc10_uk | 5738 | 5229 | 3962 | 3102 | 1.3 | 1.6 |
49: The speedup between the 4 machines and the 9 machines; 416: The speedup between the 4 machines and the 16 machines
As the number of the c colors increases, the running time of our algorithm decreases because there will be more machines to process the triangles. Recall that our algorithm preserves a balanced workload with O(m/k), thus the machines in larger models will process fewer data locally and in parallel. We can see that the speed up using SQL is considerable with large data sets with skewed vertices, such as DBpedia and as-Skitter, or with large sparse graphs as live-journal, where it doubles between the models. Also, note that our algorithm does not perform the partitioning on 1 machines, that’s why the running time on small data set is better. However, when the graph become larger, the running time on larger cluster remains more efficient.
In addition, we provide Table 6 to analyze the memory consumption of our queries. It reports the peaks of memory for each query execution on different cluster size ( and ). Our focus will be on and insertion queries, besides the triangle enumeration query. These queries are expensive because they involve many joins. The graph import query and the insertion query are almost the same between all the data sets regardless of their size.
Table 6. Memory Peaks (in GB) of the Adaptive Algorithm with SQL Queries on , and Triangle Enumeration (TE) Queries ( and resp.)
Data sets | 4 machines | 9 machines | 16 machines | ||||||
|---|---|---|---|---|---|---|---|---|---|
E_s_ proxy | E_s_ local | TE | E_s_ proxy | E_s_ local | TE | E_s_ proxy | E_s_ local | TE | |
Hyves | 0.69 | 0.69 | 0.88 | 0.69 | 0.69 | 0.69 | 0.69 | 0.69 | 0.69 |
Youtube | 0.69 | 0.69 | 0.88 | 0.69 | 0.69 | 0.69 | 0.69 | 0.69 | 0.69 |
as-Skitter | 0.74 | 0.85 | 2.2 | 0.69 | 0.69 | 1.2 | 0.69 | 0.69 | 0.69 |
Flick-links | 0.74 | 0.85 | 2.2 | 0.69 | 0.69 | 2.2 | 0.69 | 0.69 | 1.2 |
Live-journal | 1.1 | 1.6 | 1.4 | 0.69 | 0.69 | 2.4 | 0.69 | 0.69 | 2.4 |
DBpedia | 1 | 2.6 | 2.4 | 0.9 | 1 | 2.4 | 0.89 | 1 | 1.4 |
Dimasc10_uk | 1 | 2.6 | 1.4 | 0.92 | 2.4 | 2.4 | 0.89 | 1.4 | 2.4 |
By analyzing Table 6, we can see that for small data sets like Youtube and Hyves, the peaks of memory are the same between the three models. This is due to the fact that the DBMS allocates the memory according to the required data management without exceeding an allocation threshold. Vertica, for example, allocates chunks of memory that are powers of 2 to process the data. These chunks become larger when the processed data become important. However, the allocation process ensures the allocation of the available memory, and the transaction process manages the execution of the transactions without exceeding the allocated memory. That’s why, we can notice that for medium to large data sets, the peaks of memory decrease from the smallest model to the larger one since there is more available memory on the latter.
Balanced load
The main contribution of the adaptive algorithm is to preserve a balanced workload when computing the triangles. We provide in Figs. 9, 10, and 11 histogram charts for the count of triangles output on each machine on a cluster of 4, 9, and 16 machines ( and ).
[See PDF for image]
Fig. 9
Triangle Count per Machine ( and ): Balanced Load
[See PDF for image]
Fig. 10
Triangle Count per Machine ( and ): Balanced Load
[See PDF for image]
Fig. 11
Triangle Count per Machine ( and ): Balanced Load
On each model, each machine outputs the same number () of triangles as the other machines for each graph data set. This is due to the data re-balancing performed before the local triangle enumeration (counting). As mentioned previously, the number of triangles among each sub-graph on each machine is relatively balanced with high probability. Hence each machine processes essentially the same number of edges which leads to balance the workload.
Efficiency on a parallel cluster
In this section, a comparison of our adaptive algorithm with SQL queries against the classic algorithm, Spark GraphX and G-thinker is conducted.
Adaptive algorithm comparison in database environment
Comparing our adaptive algorithm with the classic algorithm on a columnar DBMS aims to prove the optimization impact of our re-balancing approach on the triangle enumeration (counting) running time. Table 7 summarizes the running time in seconds and the triangle count using both algorithms.
Table 7. Comparison between the Adaptive Algorithm and the Classic Algorithm Execution Time (in seconds), and Triangle Count ( and resp.)
Data sets | Classic algorithm with SQL queries | Adaptive algorithm with SQL queries | ||||||
|---|---|---|---|---|---|---|---|---|
Running time | Triangle count | Running time | Triangle count | |||||
4 M | 9 M | 16 M | 4 M | 9 M | 16 M | |||
Hyves | 124 | 103 | 52 | 752,401 | 22 | 16 | 12 | 752,401 |
Youtube | 52 | 23 | 14 | 3,056,386 | 16 | 13 | 9 | 3,056,386 |
as-Skitter | 170 | 114 | 77 | 28,769,868 | 55 | 47 | 30 | 28,769,868 |
Flickr-links | 1649 | 592 | 336 | 548,174,465 | 224 | 160 | 81 | 548,174,465 |
LiveJournal | 474 | 236 | 168 | 177,820,130 | 588 | 187 | 89 | 177,820,130 |
DBpedia | 12646 | 5377 | 3930 | 5,774,698,667 | 7689 | 4026 | 3760 | 328,743,411 |
Dimacs10_uk | 5323 | 4646 | 3318 | 4,451,687,605 | 5229 | 3962 | 3102 | 4,451,687,605 |
4 M: 4 Machines; 9 M: 9 Machines; 16 M: 16 Machines
From Table 7, we can see that our adaptive algorithm presents a better running time, especially for skewed vertices graphs like Youtube and Hyves. For these graphs, the classic algorithm needs to exchange lots of edges between machines, in order to enumerate all triangles. For graphs like as-skitter and Flickr-links, which are large graphs with high skewness and lots of cliques, edge reshuffling in the adaptive algorithm considerably reduces the triangle enumeration running time. On the other hand, the classic algorithm is extremely expensive, since it presents excessive communication between machines (due to a large number of highly skewed vertices). In summary, the adaptive algorithm is suitable for highly skewed graphs, regardless of their sizes.
Fig. 12 depicts a comparison between the adaptive algorithm and the classic algorithm in terms of the triangle output on each machine. As we can notice, all the lines in the adaptive algorithm charts are almost aligned, which indicates that the load balancing is preserved while computing the triangles, especially for very skewed graphs like Hyves, Flickr-links, and DBpedia. In contrast, the classic algorithm unbalances the workload, which explains its unbalanced output and its high runtime. Moreover, note that for DBpedia which is a data set with multiple edges between vertices, the classic algorithm generates lots of repetitions which explain its running time, on the other hand, the adaptive algorithm considers one edge between two vertices that are in the following order . This processing optimizes memory usage and reduces the total running time.
[See PDF for image]
Fig. 12
Load Balancing between the Adaptive Algorithm and the Classic Algorithm on 16 Machines ( and )
Adaptive algorithm with SQL queries against Spark GraphX and G-thinker comparison
In this section, we compare our adaptive algorithm executed on DBMS with G-thinker and Spark GraphX. G-thinker is a task-based execution engine for graph processing. It includes many graph algorithms such as subgraph mining and triangle counting. G-thinker benefits from the efficiency of Hadoop as a big data processing system and MPI as a parallel task execution library. This system is a good reference to compare with our solution. It is Pregel-like system that includes MPI and Hadoop. Spark GraphX, on the other hand, is a Spark component for graph-parallel computation. It is a vertex-centric system that defines lots of built-in functions including triangle counting. Our choice for G-thinker and Spark GraphX is justified by the fact that these two systems are robust and similar to DBMS in terms of capacities.
Table 8 summarizes the comparison between the three systems on 4 machines. We chose this model because: (1) we want to provide a system working even with limited memory and processing resources, (2) G-thinker is a task-based engine and can work perfectly on a small budget of memory, and (3) Spark GraphX is a powerful parallel graph processing engine. We believe that the three systems can give perfect results on larger clusters and with sophisticated hardware configurations. However, we want to position our solution, among others, on a cluster with CPU and memory limitations.
Before executing G-thinker algorithm for triangle counting, we convert all the data sets to the adjacency list format. Note that the data set formatting is time-consuming, especially for large data sets. However, we did not include this time in the total execution time of G-thinker. In addition, to ensure a fair comparison between the three systems, we executed G-thinker with 4 threads since we have 4 CPU cores on each machine.
Table 8. Comparison of Adaptive Algorithm with SQL Queries against Spark GraphX and G-thinker in terms of Running Time (in seconds), and Triangle Counting (TC) on 4 machines ( and )
Data sets | GraphX | G-thinker | Adaptive with SQL | |||
|---|---|---|---|---|---|---|
Time | TC | Time | TC | Time | TC | |
Hyves | 119 | 752,401 | 23 | 538,362 | 22 | 752,401 |
Youtube | 112 | 3,056,386 | 30 | 3,056,386 | 16 | 3,056,386 |
as-Skitter | Crash | na | 101 | 28,769,868 | 55 | 28,769,868 |
Flick-links | Crash | na | 92 | 548,174,465 | 224 | 548,174,465 |
Live-journal | Crash | na | 91 | 177,820,130 | 588 | 177,820,130 |
DBpedia | Crash | na | 28,579 | 154,277,630 | 7,689 | 328,743,411 |
Dimasc10_uk | Crash | na | Crash | na | 5,229 | 4,451,687,605 |
For the small to medium data sets (i.e. Hyves, Youtube, and as-Skitter), the adaptive algorithm is more efficient with a better execution time. Note that for Hyves data set, G-thinker gives fewer output triangles since it is a very skewed data set with , and Spark GraphX gives the worst execution time. On the other hand, G-thinker gives the best running time on larger data sets such as Flickr-links and Live-journal. For these data sets, G-thinker benefits from its partitioning strategy on HDFS to efficiently count the triangles. Whereas, Spark GraphX fails because of memory lack. Finally, G-thinker gives a very long runtime for the DBpedia data set and crashes for Dimacs10_uk data set because of a memory issue (segmentation error). Note that G-thinker didn’t count all the triangles for DBpedia. On the other hand, the adaptive algorithm finishes the triangle enumeration in about 2 h on a modest cluster. In addition, the three systems efficiently count the triangles, but only the adaptive algorithm outputs the list of the triangles beside the total triangle count. In fact, Spark GraphX for example is limited to triangle counting (no enumeration). Its triangle counting function increments the count of triangles, whenever there are three adjacent vertices having edges between them, regardless of their directions. To count the triangles, Spark creates jobs on the master node and sends them to workers for execution. The worker uses containers to read data, then produces results and sends them to the master node, which collects all the workers’ results to depict them. So the master node waits for all workers to finish, before sending the final results. All these intermediate steps consume time and space. Besides, Spark relies on the ecosystem configuration. It is more suitable for HPC, where each node defines enough memory and processing capacities. This is a reason of the failure of Spark Graphx for dealing with large graphs like as-Skitter and Flickr-links, and spending a long running time for small graphs like Youtube on our configuration. To sum up, the adaptive algorithm has an excellent execution on a modest cluster and gives the exact triangle count for all the data sets regardless of their size or their skewness.
Evaluation of adaptive algorithm with Python and MPI
In this section, we present the evaluation of our solution programmed with Python and MPI in terms of speed up, memory usage, and load balancing. We also compare it against competitive graph engines, then position it among them.
Algorithm speedup
To study the speedup of our algorithm programmed with Python and MPI, we provide Table 9 in which we compare the triangle enumeration (counting) running time by varying the size of the optimized cluster from 4 to 16 ( and ). Note that the execution on one machine is excluded, because all the data sets fail on it.
Table 9. Speedup and Memory Peaks of the Adaptive Algorithm with Python and MPI (in seconds and and resp.)
Data sets | 4 Machines | 9 Machines | 16 Machines | 49 | 416 | |||
|---|---|---|---|---|---|---|---|---|
Time | Peaks | Time | Peaks | Time | Peaks | |||
Hyves | 39 | 27MB | 23 | 20MB | 15 | 19MB | 1.6 | 2.6 |
Youtube | 23 | 43MB | 16 | 18MB | 11 | 20MB | 1.4 | 2 |
as-Skitter | Crash | na | 88 | 24MB | 54 | 20MB | na | 1.6 |
Flick-links | Crash | na | Crash | na | Crash | na | na | na |
Live-journal | Crash | na | Crash | na | 541 | na | na | na |
DBpedia | Crash | na | Crash | na | Crash | na | na | na |
Dimasc10_uk | Crash | na | Crash | na | Crash | na | na | na |
49: The speedup between the 4 machines and the 9 machines; 416: The speedup between the 4 machines and the 16 machines
MPI requires large memory size to give the best performance, therefore the adaptive algorithm with Python fails for large data sets such as as-Skitter and Live-journal on the cluster of 4 machines and gives better results on the cluster with 16 machines. As a result, we can deduce that increasing the number of colors to partition the vertices set, reduces considerably the execution time. Moreover, the speedup with the adaptive algorithm is significant from the smallest cluster to the largest. We believe that with larger memory and more sophisticated infrastructure, the adaptive algorithm with Python and MPI gives better results. To sum up, the adaptive algorithm with Python is more suitable for small data sets on any model and more efficient for larger data sets on sophisticated clusters.
In addition, Table 9 presents the memory peaks of our algorithm execution. The message passing managed by MPI and the merge operations of Python’s Pandas require a considerable memory budget. As a result, the memory peaks of small data sets are low, but when the graph becomes larger, it requires larger memory space to execute these operations. That’s why, our algorithm with Python and MPI crashes for large data sets.
Balanced load
As mentioned previously, the ultimate added value of our algorithm is ensuring a perfect load balancing. Therefore, we provide Fig. 13 which depicts the load balancing ensured by our algorithm programmed with Python and MPI on different clusters ( and ). We mainly present the data sets that succeeded the execution and we discard the others.
[See PDF for image]
Fig. 13
Triangle Count per Machine ( and ): Balanced Load
As we can see from the histograms, each machine outputs of the total triangles . This due to the fact that our partitioning strategy redistributes the edges in such a way that each machine acquires its respective edges through the dataframe , to perform the triangle enumeration locally according to its color triplets.
Efficiency on a parallel cluster
In this section, we compare the adaptive algorithm programmed with Python and MPI against the adaptive algorithm with SQL queries. Then, a comparison with Spark Graphx is conducted. Our choice for Spark GraphX is justify by the fact that both Spark GraphX and our solution with Python and MPI depends on the memory size, unlike G-thinker that can efficiently work with small memory budget.
Comparison of adaptive algorithm with python and MPI against adaptive algorithm with SQL queries
We summarize in Table 10 the execution time and the triangle count of both algorithms on different clusters ( and ).
Table 10. Adaptive Algorithm with Python and MPI againt Adaptive Algorithm with SQL Queries (in seconds with and resp.)
Data sets | Adaptive Algorithm With Python and MPI | Adaptive Algorithm With SQL Queries | ||||
|---|---|---|---|---|---|---|
4Machines | 9Machines | 16Machines | 4Machines | 9Machines | 16Machines | |
Hyves | 39 | 23 | 15 | 22 | 16 | 12 |
Youtube | 23 | 16 | 11 | 16 | 13 | 9 |
as-Skitter | Crash | 88 | 54 | 55 | 47 | 30 |
Flick-links | Crash | crash | Crash | 224 | 160 | 81 |
Live-journal | Crash | crash | 541 | 588 | 187 | 89 |
DBpedia | Crash | crash | Crash | 7689 | 4026 | 3760 |
Dimasc10_uk | Crash | crash | Crash | 5229 | 3962 | 3102 |
The adaptive algorithm programmed with Python and MPI gives equivalent results to the adaptive algorithm with SQL queries. However, MPI depends mainly on the main memory size, that’s why it crashes for the large data sets, because it cannot find enough memory to allocate. The adaptive algorithm programmed with Python and MPI becomes more efficient as the cluster size increases and the main memory becomes large enough to process the entire graph data set. On the other hand, the adaptive algorithm with SQL queries succeeds its execution on small and large data sets, due to the structure of the database system, which defines efficient main memory execution and optimized I/O techniques when the data exceeds the main memory size.
Comparison of adaptive algorithm with Python and MPI against Spark GraphX
Spark GraphX is a pregel-like system, it requires a large memory to hold and process the graph data set, in analogous to the adaptive algorithm with Python and MPI. Indeed, as explained previously, MPI becomes more efficient with a large main memory, that allows messages to be sent and received quickly. We believe that the comparison of the adaptive algorithm with Python and MPI against Spark GraphX is fair and can position our solution among others. Therefore, we present Table 11 which summarizes this comparison in terms of running time (in seconds) and triangle count (TC) on 16 machines ( and ). We chose 16 machines because we want to compare the two systems on the largest memory space that we have in order to track their behavior.
Table 11. Comparison of Adaptive Algorithm with Python and MPI against Spark GraphX in terms of Running Time (in seconds), and Triangle Counting (TC) on 16 machines ( and )
Data sets | Spark GraphX | Adaptive Algorithm With Python and MPI | ||
|---|---|---|---|---|
Time | TC | Time | TC | |
Hyves | 92 | 752,401 | 15 | 752,401 |
Youtube | 90 | 3,056,386 | 11 | 3,056,386 |
as-Skitter | Crash | na | 54 | 28,769,868 |
Flick-links | Crash | na | Crash | na |
Live-journal | Crash | na | 541 | 177,820,130 |
DBpedia | Crash | na | Crash | na |
Dimasc10_uk | Crash | na | Crash | na |
Table 11 shows that our algorithm gives better execution time on small data sets, and succeed the execution for Live-jounal data set while Spark fails. On the other hand, both systems fail for the large and very large data set because of memory lack. Our algorithm gives a good execution time compared to Spark GraphX, but it remains depending on the main memory size and it is not as efficient as the adaptive algorithm with SQL queries on modest clusters.
Pros and cons
Through these experiments, we learned the following:
The adaptive algorithm preserves a perfect load balancing regardless of the size and the type (sparse or skewed) of the graph. It outputs of the total triangle count on each machine in the model.
The adaptive algorithm presents an excellent speed up. When the model size increases, the running time decreases. This is due to the edge re-balancing and shuffling before triangle enumeration (counting).
The load balancing and the speedup of our algorithm ensure its scalability. When the graph size becomes larger, the adaptive algorithm implemented with SQL queries gives the correct results in optimized runtime even when running on the smallest cluster.
The adaptive algorithm with SQL queries allocates the necessary memory when executing the queries without exceeding the available memory. It never fails due to memory limitations because the DBMS defines memory allocation and transaction manager processes that handle the execution of the transaction according to the available memory and processing resources while defining optimized I/O techniques.
The adaptive algorithm with SQL queries is suitable for skewed graphs regardless of their sizes. The partitioning strategy ensures a balanced workload when sending edges to proxies. Hence, each machine handles the same number of triangles when performing the local enumeration.
On modest hardware configuration, the adaptive algorithm with SQL queries shows more efficiency, particularly for the small to medium data sets and on very large graphs compared to G-thinker.
The adaptive algorithm with Python and MPI is more suitable for medium-sized model, which optimized the communication cost. It can also be applied to sparse graphs that do not need an important memory size. So, it can be easily used in fields like biology, CAD,.. etc, where the triangle enumeration is only a step in their algorithms. On the other side, the adaptive algorithm with SQL queries works well for dense skewed-vertices graphs, and on any models, so it is convenient for companies that have modest hardware setups and more complicated graphs to explore.
Related work
The foundational algorithms for enumerating triangles are the node iterator [23] and the edge iterator [24] which share the same asymptotic behavior [23]. However, with the expansion of graph size, many parallel and distributed works based on these algorithms have been introduced. Arifuzzaman et al. [10] proposed an MPI-based distributed memory parallel algorithm implemented in C++. This algorithm is a distributed version of the node iterator for counting and enumerating triangles in massive networks. Rasel et al. [12] presented an index-based method for listing triangles in massive graphs. This method accesses the indexed data asynchronously and joins them to list the triangles using a multithreaded parallel processing technique. Ayed, Hacid, Haque, and Jemai [25] studied 6 algorithms for Frequent Subgraph Mining (FSM), where one of them relies on a triangle matrix to identify the subgraphs. Yu, Qin, Zhang, Zhang, and Lin [26] introduced an improved edge-iterator based algorithm that uses an adaptive orientation technique (AOT) based on the out-degree of the vertices. AOT refines the orientation technique which maps an undirected graph to a directed acyclic graph. Yu, Qin, Zhang, Zhang, and Lin [27] developed a parallel solution for triangle counting in the context of batch-dynamic graphs. The main idea behind this algorithm is to report the updated triangles (deleted triangles and new triangles) resulting from the batch of updates. In addition, much research was conducted in the MapReduce framework [17, 28, 29–30]. These solutions are time–costly because their processing is parallelized through several rounds of MapReduce. Hence, some solutions were implemented using Spark to improve these algorithms, such as WANG Zhuo [31] who proposed a solution to directly enumerate the triangle structures of candidate vertices, and Park et al. [13] who developed PTE which is a distributed algorithm for enumerating triangles in enormous graphs by resolving the structural inefficiency of the previous MapReduce algorithms. On the other hand, many solutions have been addressed based on the vertex-centric model pioneered by Google’s Pregel [32]. These systems are founded on a programming model called think-like-a-vertex, in which vertices exchange messages between them using edges to update their states. Following this model, many solutions have been proposed such as Apache Giraph, Spark GraphX [33], GraphLab[34], PowerGraph[35]. Although Pregel provides a high-level distributed programming abstract, it suffers from efficiency issues including the overhead of global synchronization, a large volume of messages, unbalanced workload, and straggler problems due to slower machine [36]. It is also inappropriate for subgraph mining algorithms. To address these limitations, some systems have emerged exploiting the model of a think-like-a-subgraph. In this model, the communication is performed at the level of subgraphs instead of vertices, to offer lower communication and memory overhead compared to the vertex-centric model. NScale [37], Arabesque [11], G-Miner [38], Fractal [39], and DISC [40] are some of the systems that follow this programming model. However, these systems are IO-bounded and may fail due to memory overhead. Yan et al. [41] presented G-thinker a subgraph-centric system with more efficiency in memory usage. This system was proposed to solve the problems of subgraph-centric systems, by presenting a task-based execution engine that can work with a small budget of memory while keeping the CPU cores fully utilized. G-thinker defines different subgraph finding problems, including triangle counting. It is built on top of a Hadoop cluster and uses MPI and threading. According to the authors, G-thinker uses all the CPU cores of the cluster, by exploiting a vertex cache and a lightweight task scheduling approach. However, this system is highly dependent on the system configuration, it requires many skills to setup the dependencies before using it.
In this paragraph we discuss closely related work by the authors of this DAPD journal paper. The initial conference version [42] proposed queries as an alternative programming mechanism for the theoretical version of the algorithm, originally introduced in [43, 44]. We must mention that the theory paper [44] proved via theorems that the adaptive rebalancing algorithm to enumerate triangles has optimal speedup as the number of machine grows, but it did not provide experimental validation on a real parallel cluster. That is, our conference version represents a first step towards enumerating triangles on a parallel DBMS with SQL queries. On the other hand, our journal version incorporates many theory and experimental additions beyond the conference version: (1) We derive a lower number of machines, which results in a more practical solution in Sec. 3.3, (2) we introduce new query optimizations applicable on any DBMS (row or columnar) in Sec. 3.4, (3) we introduce new optmizations specific to columnar DBMSs in Sec. 3.4, (4) we provide a detailed MPI implementation to contrast a database solution with an HPC solution in Sec. 3.5, and (5) we provide extensive experiments comparing the SQL solution with alternative graph analytic systems (traditional and vertex-centric) in Sec. 5.3.3.
Conclusions
We presented an adaptive algorithm with SQL queries for triangle enumeration (counting) on large graphs. Our approach provides an elegant, short, and abstract solution that can be easily integrated into the data analysis pipeline. Our main aim is to introduce a distributed algorithm that can work inside DBMS, with big data systems and data analysis tools. We presented an intelligent vertex partitioning strategy, that is based on partitioning the graph over the cluster machines, in such a way that eliminates any data exchange during the task of enumeration. It sends edges to their respective machines without having any multiple copies of them while employing the necessary optimizations. Then, we extract locally all the triangles in each sub-graph. We proved that our approach scales well, especially with skewed graphs. Our partitioning strategy ensures a balanced load data distribution. Our experiments showed that our adaptive algorithm with SQL queries presented an excellent speedup, and it is efficient with skewed graphs compared to the classic algorithm, Spark GraphX, and G-thinker on a very modest cluster. In addition, we presented an adaptive algorithm implemented with Pandas and MPI, as an alternative to our solution with SQL queries. Our algorithm with Python and MPI proved its efficiency on small to medium data sets and on large models. In summary, our experimental study has proven that our adaptive algorithm can be as efficient as graph analysis engines.
For our future work, we are planning to study query processing trade-offs on the cloud and on big data systems like Spark SQL. In addition, we want to experimentally evaluate our adaptive algorithm with multi-core processing (1 machine with k-cores). Finally, we intend to extend our algorithm for the clique detection problem, which is another computationally challenging graph problem.
Author contributions
Conceptualization: AF, CO; Methodology: CO; Formal analysis and investigation: AF, XZ; Writing−original draft preparation: AF, XZ; Writing−review and editing: AF, XZ, CO, LB; Funding acquisition: MM; Resources: LB; Supervision: CO, LB, MM.
Funding
The authors did not receive support from any organization for the submitted work.
Declarations
Conflict of interest
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Wimmer, A; Lewis, K. Beyond and below racial homophily: Erg models of a friendship network documented on facebook. Am. J. Sociol.; 2010; 116,
2. Wasserman, S; Faust, K. Social network analysis: methods and ap- plications; 1994; Cambridge, Cambridge University Press: [DOI: https://dx.doi.org/10.1017/CBO9780511815478]
3. Kolountzakis, MN; Miller, GL; Peng, R; Tsourakakis, CE. Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Math.; 2012; 8,
4. Eckmann, J-P; Moses, E. Curvature of co-links uncovers hidden thematic layers in the world wide web. Proc. Natl. Acad. Sci.; 2002; 99,
5. Becchetti, L., Boldi, P., Castillo, C., Gionis, A.: Efficient semi-streaming algorithms for local triangle counting in massive graphs. In: Proceedings of the 14th ACM SIGKDD pp. 16-24, (2008). https://doi.org/10.1145/1401890.1401898
6. Welser, HT; Gleave, E; Fisher, D; Smith, M. Visualizing the signatures of social roles in online discussion groups. J. Soc. Struct.; 2007; 8,
7. Mirza, BJ; Keller, BJ; Ramakrishnan, N. Studying recommendation algorithms by graph analysis. J. Intell. Inf. Syst.; 2003; 20,
8. Fudos, I; Hoffmann, CM. A graph-constructive approach to solving systems of geometric constraints. ACM Trans. Graph.; 1997; 16,
9. Berry, JW; Fostvedt, LA; Nordman, DJ; Phillips, CA; Seshadhri, C; Wilson, AG. Why do simple algorithms for triangle enumeration work in the real world?. Internet Math.; 2015; 11,
10. Arifuzzaman, S; Khan, M; Marathe, M. A space-efficient parallel algorithm for counting exact triangles in massive networks. Hpcc; 2015; [DOI: https://dx.doi.org/10.1109/HPCC-CSS-ICESS.2015.301]
11. Teixeira, C.H.C., Fonseca, A.J., Serafini, M., Siganos, G., Zaki, M.J., Aboulnaga, A.: Arabesque: A system for distributed graph mining. In: Proceedings of the 25th Symposium on Operating Systems Principles pp. 425-440 (2015). New York: Association for Computing Machinery. Retrieved from https://doi.org/10.1145/2815400.2815410
12. Rasel, MK; Han, Y; Kim, J; Park, K; Tu, NA; Lee, Y-K. Itri: index-based triangle listing in massive graphs. Inf. Sci.; 2016; 336, pp. 1-20. [DOI: https://dx.doi.org/10.1016/j.ins.2015.12.006]
13. Park, H-M; Myaeng, S-H; Kang, U. PTE: enumerating trillion triangles on distributed systems. ACM SIGKDD; 2016; [DOI: https://dx.doi.org/10.1145/2939672.2939757]
14. Zhu, Y; Zhang, H; Qin, L; Cheng, H. Efficient MapReduce algorithms for triangle listing in billion-scale graphs. DAPD J.; 2017; 35,
15. Klauck, H., Nanongkai, D., Pandurangan, G., Robinson, P.: Distributed computation of large-scale graph problems. In: Proceedings of the 26th ACM-SIAM SODA, pp. 391-410 (2015)
16. Al Hasan, M; Dave, VS. Triangle counting in large networks: a review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov.; 2018; 8,
17. Cohen, J. Graph twiddling in a mapreduce world. Comput. Sci. Eng.; 2009; 11,
18. Xirogiannopoulos, K; Khurana, U; Deshpande, A. Graphgen: exploring interesting graphs in relational data. Proc. VLDB Endow.; 2015; 8,
19. Bordoloi, S; Kalita, B. Article: designing graph database models from existing relational databases. Intern. J. Comput. Appl.; 2013; 74,
20. De Virgilio, R; Maccioni, A; Torlone, R. Converting relational to graph databases. GRADES 2013 co-located with SIGMOD/PODS; 2013; [DOI: https://dx.doi.org/10.1145/2484425.2484426]
21. Leskovec, J., Krevl, A.: SNAP Datasets: stanford large network dataset collection. http://snap.stanford.edu/data (2014)
22. Kunegis, J.: Konect: The koblenz network collection. http://konect.uni-koblenz.de. Association for Computing Machinery (2013)
23. Schank, T; Wagner, D. Finding, counting and listing all triangles in large graphs, an experimental study. Exp. Effic. Algorithms; 2005; [DOI: https://dx.doi.org/10.1007/11427186_54]
24. Itai, A; Rodeh, M. Finding a minimum circuit in a graph. SIAM J. Comput.; 1978; 7,
25. Ayed, R; Hacid, M; Haque, R; Jemai, A. An updated dashboard of complete search FSM implementations in centralized graph transaction databases. J. Intell. Inf. Syst.; 2020; 55,
26. Yu, M., Qin, L., Zhang, Y., Zhang, W., Lin, X.:. Aot: Pushing the efficiency boundary of main-memory triangle listing. In: Nah, Y., Cui, B., Lee, S.-W., Yu, J.X., Moon, Y.-S., Whang, S.E. (Eds.), Database systems for advanced applications (pp. 516-533). Cham: Springer International Publishing (2020)
27. Yu, M; Qin, L; Zhang, Y; Zhang, W; Lin, X. Dptl+: efficient parallel triangle listing on batch-dynamic graphs. IEEE ICDE; 2021; [DOI: https://dx.doi.org/10.1109/ICDE51399.2021.00119]
28. Afrati, FN; Sarma, AD; Salihoglu, S; Ullman, JD. Upper and lower bounds on the cost of a MapReduce computation. PVLDB; 2013; 6,
29. Park, H-M; Chung, C-W. An efficient MapReduce algorithm for counting triangles in a very large graph. ACM CIKM; 2013; [DOI: https://dx.doi.org/10.1145/2505515.2505563]
30. Zhu, Y; Zhang, H; Qin, L; Cheng, H. Efficient mapreduce algorithms for triangle listing in billion-scale graphs. DAPD; 2017; 35,
31. Zhuo, W; Pan, B; Bo, S. Parallel algorithm for triangle enumeration. J. Compu. Appl.; 2017; 37,
32. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: Proceedings of the 2010 Acm Sigmod Interna-tional Conference on Management of Data (p. 135-146) (2010). New York, : Association for Computing Machinery. Retrieved from https://doi.org/10.1145/1807167.1807184
33. Xin, RS; Gonzalez, JE; Franklin, MJ; Stoica, I. Graphx a: resilient distributed graph system on spark. First international work- shop on graph data management experiences and systems; 2013; New York, Association for Computing Machinery: [DOI: https://dx.doi.org/10.1145/2484425.2484427]
34. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Graphlab: A new framework for parallel machine learning. In: Grünwald, P., Spirtes, P. (Eds.), UAI 2010, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Catalina Island, vol. 8-11, pp. 340-349. AUAI Press (2010)
35. Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: Distributed graph-parallel computation on natural graphs. In: Thekkath, C., Vahdat, A. (Eds.), 10th USENIX symposium on op- erating systems design and implementation, OSDI 2012, hollywood, ca, usa, october 8-10, 2012 (pp. 17-30) (2012). USENIX Association. Retrieved from https://www.usenix.org/conference/osdi12/technicalsessions/ presentation/gonzalez
36. Yan, D., Bu, Y., Tian, Y., Deshpande, A., Cheng, J.: Big graph analytics systems. In: Proceedings of the 2016 International Conference on Management of Data pp. 2241-2243) (2016). New York: Association for Computing Machinery. Retrieved from https://doi.org/10.1145/2882903.2912566
37. Quamar, A; Deshpande, A; Lin, J. Nscale: neighborhoodcentric large-scale graph analytics in the cloud. VLDB J.; 2016; 25,
38. Chen, H., Liu, M., Zhao, Y., Yan, X., Yan, D., Cheng, J.: G-miner: an efficient task-oriented graph mining system. In: Oliveira, R., Felber, P., Hu, Y.C. (Eds.), Proceedings of the 13th eurosys conference, pp. 1-12. ACM, (2018). https://doi.org/10.1145/3190508.3190545
39. dos Santos Dias, V.V., Teixeira, C.H.C., Guedes, D.O., Jr., W.M., Parthasarathy, S.: Fractal: A general-purpose graph pattern mining system. SIGMOD (pp. 1357-1374). ACM, (2019). https://doi.org/10.1145/3299869.3319875
40. Zhang, H; Yu, JX; Zhang, Y; Zhao, K; Cheng, H. Distributed subgraph counting: a general approach. Proc. VLDB Endow.; 2020; 13,
41. Yan, D., Guo, G., Rahman Chowdhury, M.M., Tamer Özsu, M., Ku, W.-S., Lui, J.C.S.: G-thinker: A distributed framework for mining subgraphs in a big graph. In: 2020 ieee 36th International Conference on Data Engineering (icde), p. 1369-1380 (2020). https://doi.org/10.1109/ICDE48307.2020.00122
42. Farouzi, A; Bellatreche, L; Ordonez, C; Pandurangan, G; Malki, M. A scalable randomized algorithm for triangle enumeration on graph based on SQL queries. Dawak Conf.; 2020; [DOI: https://dx.doi.org/10.1007/978-3-030-59065-9_12]
43. Pandurangan, G; Robinson, P; Scquizzato, M. Fast distributed algorithms for connectivity and MST in large graphs. ACM Trans. Parallel Comput.; 2018; 5,
44. Pandurangan, G; Robinson, P; Scquizzato, M. On the distributed complexity of large-scale graph computations. SPAA; 2018; [DOI: https://dx.doi.org/10.1145/3210377.3210409]
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023.