Keywords: data mining, clustering, overlapping clustering, GPU computing
Received: November 18, 2016
Pattern Recognition and Data Mining pose several problems in which, by their inherent nature, it is considered that an object can belong to more than one class; that is, clusters can overlap each other. OClustR and DClustR are overlapping clustering algorithms that have shown, in the task of documents clustering, the better tradeoff between quality of the clusters and efñciency, among the existing overlapping clustering algorithms. Despite the good achievements attained by both aforementioned algorithms, they are O(n2) so they could be less useful in applications dealing with a large number of documents. Moreover, although DClustR can efñciently process changes in an already clustered collection, the amount of memory it uses could make it not suitable for applications dealing with very large document collections. In this paper, two GPU-based parallel algorithms, named CUDA-OClus and CUDA-DClus, are proposed in order to enhance the efñciency of OClustR and DClustR, respectively, in problems dealing with a very large number of documents. The experimental evaluation conducted over several standard document collections showed the correctness of both CUDA-OClus and CUDA-DClus, and also their better performance in terms of efficiency and memory consumption.
Povzetek: OClustR in DClustR sta prekrivna algoritma za grucenje, ki dosegata dobre rezultate, vendar je njuna kompleksnost kvadratnega reda velikosti. V tem príspevku sta predstavljena dva paralelna algoritma, ki temeljita na GPU: CUDA-OClus in CUDA-DClus. V eksperimentih sta pokazala zmožnost dela z velikimi kolicinami podatkov.
(ProQuest: ... denotes formulae omitted.)
1Introduction
Clustering is a technique of Machine Learning and Data Mining that has been widely used in several contexts [1]. This technique aims to structure a data set in clusters or classes such that objects belonging to the same class are more similar than objects belonging to different classes [2].
There are several problems that, by their inherent nature, consider that objects could belong to more than one class [3,4,5]; that is, clusters can overlap each other. Most of the clustering algorithms developed so far do not consider that clusters could share elements; however, the desire of adequately target those applications dealing with this problem, have recently favored the development of overlapping clustering algorithms; i.e., algorithms that allow objects to belong to more than one cluster. An overlapping clustering algorithm that has shown, in the task of documents clustering, the better tradeoff between quality of the clusters and efficiency, among the existing overlapping clustering algorithms, is OClustR [6]. Despite the good achievements attained by OClustR in the task of documents clustering, it has two main limitations:
1. It has a computational complexity of O(n2), so it could be less useful in applications dealing with a large amount of documents.
2. It assumes that the entire collection is available before clustering. Thus, when this collection changes it needs to rebuild the clusters starting from scratch; that is, OClustR does not use the previously built clustering for updating the clusters after changes.
In order to overcome the second limitation, the DClustR algorithm was proposed by Pérez-Suárez et al. in [7]. DClustR introduced a strategy for efficiently updating the clustering after multiple additions and/or deletions from the collection, making it suitable for handling overlapping clustering in applications where the collection changes frequently, specially for those applications handling multiple changes at the same time. Nevertheless, DClustR still suffers from the first limitation; that is, like OClustR, it is O(n2). This implies that when the collection grows a lot, the time that DClustR uses for processing the changes could make it less useful in real applications. Moreover, when the collection grows, the memory space used by DClustR for storing the data it needs will also grow, making DClustR a high memory consumer and consequently, making it not suitable for applications dealing with large collections. Motivated by the above mentioned facts, in this work we extend both OClustR and DClustR for efficiently processing very large document collections.
A technique that has been widely used in recent years in order to speed-up computing tasks is parallel computing and specifically, GPU computing. A GPU is a device that was initially designed for processing algorithms belonging to the graphical world, but due to its low cost, its high level of parallelism and its optimized floating-point operations, it has been used in many real applications dealing with a large amount of data.
The main contribution of this paper is the proposal of two GPU-based parallel algorithms, namely CUDAOClus and CUDA-DClus, which enhance the efficiency of OClustR and DClustR, respectively, in problems dealing with a very large number of documents, like for instance news analysis, information organization and profiles identification, among others.
Preliminary results of this paper were published in [8]. The main differences of this paper with respect to the conference paper presented in [8] are the following: (1) we introduce a new GPU-based algorithm, named CUDADClus, which is a parallel version of the DClustR algorithm, that is able to efficiently process changes in an already clustered collection and to efficiently process large collections of documents, and (2) we introduce a strategy for incrementally building and updating the connected components presented in a graph, allowing CUDA-DClus to minimize the memory needed for processing the whole collection. It is important to highlight that in CUDA-DClus we only analyze the additions of objects to the collection, because this is the case in which it could be difficult to apply DClustR in real applications dealing with large collections, since this is the case that makes the collection grow.
The remainder of this paper is organized as follows: in Section 2, a brief description of both the OClustR and DClustR algorithms are presented. In Section 3, the CUDA-OClus and CUDA-DClus parallel clustering algorithms are proposed. An experimental evaluation, showing the performance of both proposed algorithms on several document collections, is presented in Section 4. Finally, the conclusions as well as some ideas about future directions are presented in Section 5.
2OClustR and DClustR algorithms
In this section, both the OClustR [6] and DClustR [7] algorithms are described. Since DClustR is the extension of OClustR for efficiently processing collections that can change due to additions, deletions and modifications, the OClustR is first introduced and then, the strategy used by DClustR for updating the clustering after changes is presented. All the definitions and examples presented in this section were taken from [6, 7].
2.1The OClustR algorithm
In order to build a set of overlapping clusters from a collection of objects, OClustR employs a strategy comprised of three stages. In the first stage, the collection of objects is represented by OClustR as a weighted thresholded similarity graph. Afterwards, in the second stage, an initial set of clusters is built through a cover of the graph representing the collection, using a special kind of subgraph. Finally, in the third stage the final set of overlapping clusters is obtained by improving the initial set of clusters. Following, each stage is briefly described.
Let O = joi, o2,..., on} be a collection of objects, ß G [0,1] a similarity threshold, and S:O x O ^ ¾ a symmetric similarity function. A weighted thresholded similarity graph, denoted as Gß = (V, Eß, S), is an undirected and weighted graph such that V = O and there is an edge (v, u) G Eß iff S(v, u) > ß; each edge (v, u) G Eß, v = u is labeled with the value of S(v,u). As it can be inferred, in the first stage OClustR must compute the similarity between each pair of objects; thus, the computational complexity of this stage is O(n2).
Once GEß is built, in the second stage OClustR builds an initial set of clusters through a covering of Gß, using weighted star-shaped sub-graphs.
Let Gß = (V, 'Eß, S) be a weighted thresholded similarity graph. A weighted star-shaped sub-graph (ws-graph) in Gß, denoted by G· = (V·, E·, S), is a sub-graph of Gß, having a vertex c G V·, called the center of G·, such that there is an edge between c and all the other vertices in V· \ jc}; these vertices are called satellites. All vertices in (Gß having no adjacent vertices (i.e., isolated vertices) are considered degenerated ws-graphs.
For building a covering of Gß using ws-graphs, OClustR must build a set W = jG·,G·,... ,G\] of ws-graphs of Gß, such that V = UÍ=i V·, being V·, Уг = 1... ⅛, the set of vertices of the ws-graph G·. For solving this problem, OClustR searches for a list C = jci,c2,... ,ck}, such that c G C is the center of G· g W, Уг = 1..k. In the following, we will say that a vertex v is covered if it belongs to C or if it is adjacent to a vertex that belongs to C. For pruning the search space and for establishing a criterion in order to select the vertices that should be included in C, the concept of relevance of a vertex is introduced.
The relevance of a vertex v, denoted as v.relevance, is defined as the average between the relative density and the relative compactness of a vertex v, denoted as v.densityR and v.compactnessR, respectively, which are defined as follows:
...
where v.Adj and u.Adj are the set of adjacent vertices of v and u, respectively; G· and Gu are the ws-graphs determined by vertices v and u, and AIS(G·) and AIS(G·u) are the approximated intra-cluster similarity of G· and G?, respectively. The approximated intra-cluster similarity of a ws-graph G· is defined as the average weight of the edges existing in G· between its center and its satellites.
Based on the above definitions, the strategy that OClustR uses in order to build the list C is composed of three steps. First, a candidate list L containing the vertices having relevance greater than zero is created; isolated vertices are directly included in C. Then, L is sorted in decreasing order of their relevance and each vertex v e L is visited. If v is not covered yet or it has at least one adjacent vertex that is not covered yet, then v is added to C. Each selected vertex, together with its adjacent vertices, constitutes a cluster in the initial set of clusters. The second stage of OClustR also has a computational complexity of O(n2). Figure 1 shows through an example, the steps performed by OClustR in the second stage for building the initial set of clusters.
Finally, in the third stage, the final clusters are obtained though a process which aims to improve the initial clusters. With this aim, OClustR processes C in order to remove the vertices forming a non-useful ws-graph. A vertex v forms a non-useful ws-graph if: a) there is at least another vertex u e C such that the ws-graph u determines includes v as a satellite, and b) the ws-graph determined by v shares more vertices with other existing ws-graphs than those it only contains. For removing non useful vertices, OClustR uses three steps. First, the vertices in C are sorted in descending order according to their number of adjacent vertices. After that, each vertex v e C is visited in order to remove those non-useful ws-graphs determined by vertices in (v.Adj П C). If a ws-graph G?, with u e (v.Adj П C), is non-useful, u is removed from C and the satellites it only covers are "virtually linked" to v by adding them to a list named v.Linked; in this way, those vertices virtually linked to v will also belong to the ws-graph v determines. Once all vertices in (v.Adj n C) are analyzed, v together with the vertices in v.Adj and v.Linked constitute a final cluster. This third stage also has a computational complexity of O(n2). Figure 2 shows through an example, how the final clusters are obtained from the initial clusters showed in Figure 1(d).
2.2Updating the clusters after changes: the DClustR algorithm
Let Gß = (V, Eß, S} be the weighted thresholded similarity graph that represents an already clustered collection O. Let C = jci, c2,..., ck} be the set of vertices representing the current covering of Gß and consequently, the current clustering. When some vertices are added to and/or removed from O (i.e., from Gß), there could happen the following two situations:
1) Some vertices become uncovered. This situation occurs when at least one of the added vertices is uncovered or when those vertices of C covering a specific vertex were deleted from Gß.
2) The relevance of some vertices changes and, as a consequence, at least one vertex u </ C appears such that u has relevance greater than at least one vertex in C that covers vertices in u.Adj U {u}. Vertices like u could determine ws-graphs with more satellites and less overlapping with other ws-graphs than other wsgraphs currently belonging to the covering of Gß.
Figure 3Jllustrates the above commented situations over the graph Gß of Figure 1(a). Figure. 3(a), shows the graph (Gß before the changes; the vertices to be removed are marked with an "x". Figure 3(b), shows graph (Gß after the changes; vertices filled with light gray represent the added vertices. Figures 3(c) and 3(d), show the updated graph Gß with vertices labeled with letters and with their updated value of relevance, respectively; vertices filled with black correspond with those vertices currently belonging to C. As it can be seen from Figures 3(c) and 3(d), vertices S, F, G, I, H and J became uncovered after the changes, while vertex B, which does not belong to C, has a relevance greater than vertex D, which already belongs to C.
Taking into account the above mentioned situations, in order to update the clustering after changes DClustR first detects which are the connected components of Gß that were affected by changes and then it iteratively updates the covering of these components and consequently, their clus- tering.
The connected components that are affected by changes are those that contain vertices that were added or vertices that were adjacent to vertices that were deleted from Gß. Since DClustR has control over these vertices it can build these components through a depth first search, starting from any of these vertices. Let G' = (V', E', S) be a connected component affected by changes, whose covering must be updated. Let C' C C be the set of vertices of G' which determine ws-graphs (i.e., clusters) covering G'. DClustR follows the same principles of OClustR; that is, it first builds or completes the covering of G' in order to build an initial set of clusters (stage 1) and then, it improves these clusters in order to build the final set of clusters of G' (stage 2). In fact, DClustR uses the same steps that OClustR for the above two mentioned stages, but unlike OClustR, DClustR modifies the way in which the candidate list L, used in stage 1, is built.
In order to build candidate list L, DClustR first recomputes the relevance value of all vertices in G' and it empties the list c.Linked, for all vertices c G C'; this last action is supported by the fact that, after changes, there could be wsgraphs that were considered as non useful, which could be no longer so. Let V+ C (V' \ C') be the set of vertices of G' with relevance greater than zero, which do not belong to C'. For building the candidate list L, both C' and V+ are processed.
For processing V+, DClustR visits each vertex v G V+ and it verifies a) if v is uncovered, or b) if at least one adjacent vertex of v is uncovered, or c) if there is at least one vertex u g v.Adj, such that there is no other vertex in C' covering u whose relevance is greater than or equal to the relevance of v. If any of these three conditions is fulfilled, v is added to L. Additionally, if the last condition is fulfilled, all those vertices like u are marked as "activated" in order to use them when C' is being processed. The computational complexity of the processing of V+ is O(n2).
For processing C', DClustR visits the adjacent vertices of each vertex v g C'. Any vertex u G v.Adj having greater relevance than v is added to L; in these cases, v is additionally marked as "weak". Once all the adjacent vertex of v were visited, if v was marked as "weak" or at least one of its adjacent vertices were previously marked as "active", v is removed from C' since it could be substituted by a more relevant vertex. However, if v has a relevance greater than zero, it is still considered as a candidate and consequently, it is added to L. The computational complexity of the processing of C' is O(n2).
Figure 4, shows the updated set of overlapping clusters obtained by DClustR when it processes the graph in Figure 3(d); vertices filled with black represent the vertices determining ws-graphs that cover each connected component of Gß.
Like OClustR, the computational complexity of DClustR is O(n2).
3Proposed parallel algorithms
As it was mentioned in Section 1, despite the good achievements attained by OClustR and DClustR in the task of documents clustering, these algorithms are O(n2) so they could be less useful in applications dealing with a very large number of documents. Motivated by this fact, in this section two massively parallel implementations in CUDA of OClustR and DClustR are proposed in order to enhance the efficiency of OClustR and DClustR in the above mentioned problems. These parallel algorithms, namely CUDAOClus and CUDA-DClus, take advantage of the benefits of GPUs, like for instance, the high bandwidth communication between CPU and GPU, and the GPU memory hierarchy.
Although in their original articles both OClustR and DClustR were proposed as general purpose clustering algorithms, the parallel extensions proposed in this work are specifically designed for processing documents. This application context is the same in which both OClustR and DClustR were evaluated and it is also a context in which very large collections are commonly processed. In the context of document processing, both CUDA-OClus and CUDA-DClus use the cosine measure [9] for computing the similarity between two documents; this measure is the function that has been used the most for this purpose [10]. The cosine measure between two documents d¿ and dj is defined as:
... (1)
where d¿(k) and dj (k) are the weights of the k term in the description of the documents d¿ and dj, respectively; ||dj|| and ||dj I are the norms of documents d¿ and dj, respectively.
In experiments conducted over several document collections, it was verified that the first stage of OClustR, the construction of the similarity graph, consumes the 99% of the processing time of the algorithm. The remaining 1% is mainly dominated by the computation of the relevance of the vertices. Based on this fact, the above two mentioned steps are the ones that will be implemented in CUDA by CUDA-OClus; remaining steps are high memory consuming tasks that are more favored with a CPU implementation. Analogously, in these experiments it was also verified that the most time consuming steps of DClustR are the updating of the graph after changes and the recomputing of the relevance, so these steps will be implemented in CUDA by CUDA-DClus. In this case, it could be noticed also that the detection of the connected components affected by changes is a high memory consuming task performed by DClustR, so it is also important to address this problem in CUDA-DClus.
Finally, it is also important to mention that since we are dealing with the problem of processing very large document collections, CUDA-DClus only tackles additions, which are the changes that could increase the size of the collection. Implementing deletions is irrelevant for overcoming problems related with large document collections.
Following, the CUDA-OClus algorithm is first introduced and then, the CUDA-DClust algorithm is presented.
3.1 CUDA-OClus algorithm
Let D = { di, d2..., dn } be a collection of documents described by a set of terms. Let T = {tb t2,..., tm} be the list containing all the different terms that describe at least one document in D. CUDA-OClus represents a document d € D by two parallel vectors, denoted by Tdi and Wdi. The first one contains the position that the terms describing d· have in T, and the second one contains the weights that those terms have in the description of d·.
For building Gß = (V, Eß, S), OClustR demands S to be a symmetric similarity measure, so the similarity between any two documents (i.e., vertices in Gß) needs to be computed only once. Based on this fact and considering the inherent order the documents have inside a collection D (i.e., vertices in V), for building the edges relatives to a vertex v € V it is only necessary to compute the similarity between v and each vertex following v in V. Let Sucv be the list of vertices that follow a vertex v in V. To speed up the construction of Gß, for each vertex v € V, CUDAOClus will compute in parallel the similarity between v and the vertices in Sucv.
Considering the definition of the cosine measure, it can be seen from Expression (1) that its numerator is a sum of independent products which could be computed all at once. On the other hand, taking into account that the norm of a document can be computed while the document is being read, the denominator of Expression (1) can be also resolved with no extra time. Based on these facts, CUDA-OClus also parallelizes the computation of the similarity between a pair of vertices, in order to speed up even more the construction of Gß.
In order to carry out the previous idea, CUDA-OClus builds a grid comprised of k square blocks, each block having a shared memory square matrix (SMM); where k = n + 1 and t is the dimension of both the blocks and the matrices. A grid is a logic representation of a matrix of threads in the GPU. The use of SMM and its low latency will allow CUDA-OClus to not constantly access the CPU memory, speeding up the calculus of the similarity between two vertices. CUDA-OClus assigns to t the maximum value allowed by the architecture of the GPU for the dimension of a SMM.
When CUDA-OClus is going to compute the similarity between a vertex v and the vertices in Sucv, it first builds a vector Pv of size m. This vector has zero in all its entries excepting in those expressed by the positions stored in Tv; these last entries contain their respective weights stored in Wv. Once Pv has been built, the list Sucv is partitioned into k sublists. Each one of these sublists is assigned to a block constituting the grid and the SMM associated with that block is emptied; i.e., all its cells are set to zero. When a sublist Q = {vi, v2,..., vp} is assigned to a block inside a grid, each vertex in Q is assigned to a column of the block. In this context, to assign a vertex v· to a column means that each row of the column points to a term describing v·; in this way, the jth row points to the jth term describing v·. Figure 5 shows an example of how the list Sucv is divided by CUDA-OClus into k sublists and how these sublists are assigned to the blocks constituting the grid. The example on Figure 5 shows how the first vertex of the sublist assigned to "block 0" is assigned to the first column of that block; the other assignments could be deduced from this example.
Each row inside a column of a block has a thread that performs a set of operations. In our case, the threads associated with the ith column will compute the similarity between v and its assigned vertex v·. With this aim, the thread associated with each row inside the ith column will compute the product between the weight that the term pointed by that row has in the description of v·, and the weight this same term has in the description of vertex v. It is important to note that although the sum in the numerator of Expression (1) runs over all the terms in T, the products that will be different from zero are only those between terms shared by both documents; this is the reason we only use the terms of vi and multiply their weights by the weights that these terms have in v; remaining terms in v are useless.
Given that the jth row of the column to which vertex vi has been assigned, points to the jth term of Tvi, the weight this term has in vi is stored at the jth position of Wvi and the weight this same term has in v is stored at Pv, in the entry referred by the value stored at the jth position of Tv.. The result of the product between the above mentioned weights is added to the value the j th row already has in the SMM. If the description of a vertex vi assigned to a column of a block exceeds the length of the column (i.e., t) a tiling is applied at this block. Tiling [11] is a technique that consists on dividing a data set into a number of small subsets, such that each subset fits into the block; i.e., the SMM. Thus, when the rows of a column point at the next t terms, the products between the weights these terms have in the description of vi and v are computed and accumulated into the values these rows have in the SMM. This technique is applied until all the terms describing the vertices assigned to the columns have been processed. Figure 6 shows how the similarity between the vertex vi assigned to the first column of "Block 0" and v is computed. In this example, it has been assumed that there are 15 terms describing the documents of the collection, the size of the block is k = 5, and Tv = {1, 2,5,8,10,12,14}, Wv = {0.2,0.6, 0.3,0.7, 0.2,0.1, 0.5} Tv, = {2,3, 5, 8,10,11,12,14} and Wv, = {0.5, 0.3, 0.4, 0.8, 0.2, 0.6, 0.3, 0.3}.
As it can be seen from Figure 6(a), each thread of the t rows of the first column computes the product between the weight of the term it points at, and the weight this same term has in Pv (i.e., the description of v). As it was mentioned before, the computed products are stored in the SMM of that block. Note from Figure 6(a) that the product computed by the second row is zero since vertex v does not contain the term pointed out by this row; i.e., term having index 3rd in T. Figure 6(b) shows how when Tiling is applied, the remaining terms describing vi are pointed by the rows of the first column. Figure 6(c) shows how the products between the remaining terms of vi and v are performed. Finally, Figure 7 shows which are the values stored in the first column of the SMM of "Block 0", once all the products have been computed.
Once all the terms describing the vertices assigned to a block have been processed, a reduction is applied over each column of the block. Reduction [12] is an operation that computes in parallel the sum of all the values of a column of the SMM and then, it stores this sum in the first row of the column. Figure 8 shows the final sum obtained for the first column of "Block 0".
The sum obtained on the column to with vertex vi has been assigned corresponds with the numerator of the cosine measure between v and vi. This sum is then divided by the product of the norms of v and vi, which have been previously computed; the result of this division (i.e., the similarity between v and vi) is copied to the CPU. Using this result CUDA-OClus decides if it should create or not an edge between v and v¿, during this step CUDA-OClus also updates the value of AIS(v) and AISfvA
The pseudocode of cosine similarity function is shown in Algorithm 1.
Once the thresholded similarity graph Gß has been built, CUDA-OClus speeds up the computation of the other timeconsuming step: the computation of the relevance of the vertices. In order to do that, CUDA-OClus computes in parallel the relevance of all the vertices of (Gß. Moreover, for each vertex v, CUDA-OClus computes in parallel the contribution each adjacent vertex of v has over the relevance of v, speeding up even more the computation of the relevance of v. In order to accomplish this idea, the list of vertices of Gß is partitioned into k sublists and each sublist is assigned to a block inside a grid. However, in this case, when a vertex v· of a sublist is assigned to a column of a block, each row in that column will point to an adjacent vertex of v·; e.g., the jth row points at the jth adjacent vertex of v·. Different from building graph Gß, now the threads associated with a column will compute the relevance of its assigned vertex. With this aim, the thread on each row of that column will compute the contributions the vertex pointed by that row has over the relevance of the vertex assigned to the column.
Let v be a vertex assigned to a column and u one of its adjacent vertices. Vertex u contributes jv~x3f\ to the relevance of v if \v.Adj\ > \u.Adj\; otherwise, its contribution is zero. This case represents the contribution u has to the relevance of v through the relative density of v. On the other hand, u contributes jv~kďf\ to the relevance of v if AIS (v) > AIS (u); otherwise, its contribution is zero. This other case represents the contribution u has to the relevance of v through the relative compactness of v. The total contribution provided by a vertex is added to the value the row already has in the SMM; similar to the case of building graph Gß, the SMM of each block is initially emptied. If v has more than t adjacent vertex, a Tiling is applied. Once all the adjacent vertices of v has been processed, a Reduction is applied in order to compute the relevance of v. Obtained values are then copied to the CPU.
The pseudocode of cosine similarity function is shown in Algorithm 2.
As it was mentioned before, the remaining steps of OClustR were not implemented in CUDA because they are more favored with a CPU implementation since they are high memory consumption tasks.
3.2 CUDA-DClus algorithm
In order to update an already clustered collection when changes take effect, in our case additions, DClustR first detects, in the graph Gß representing the collection, which are the connected components that were affected by changes and then, it updates the cover of those components and consequently, the overall clustering of the collection.
As it was stated in Section 2.2, the connected components affected by additions are those containing at least one added vertex. Thus, each time vertices are added to Giß, in addition to computing the similarity between these vertices and those already belonging to (Gß in order to create the respective edges, DClustR also needs to build from scratch each affected connected component in order to update their covers. In order to reduce the amount of information DClustR needs to store in memory, CUDA-DClus proposes to represent the graph (Gß using an array of partial connected components, named Arrpcc, and two parallel arrays. The first of these parallel arrays, named V, contains the vertices in the order in which they were added to Gß. The second array, named PCV, contains the index of the partial connected component to which each vertex belongs. This new representation allows CUDA-DClus to not need to rebuild the affected components each time the collection changes, but keeping the affected components updated each time vertices are added to the graph Gß, with no extra cost.
Let Gß = (V, Eß, S} be the thresholded similarity graph representing the collection opdocuments. A partial connected component (PCC) in Gß is a connected subgraph induced by a subset of vertices of Gß .A partial connected component is represented using two arrays: one array containing the indexes the vertices belonging to that component have in Gß, and the other array containing the adjacency list of the aforementioned vertices.
The array of partial connected components representing Gß is built once while Gß is being constructed. The strategy used by CUDA-DClus for this purpose is as follows. In the first step, CUDA-DClus adds a vertex in V for each document of the collection and then, PCV is emptied (i.e., it is filled with -1), meaning that the vertices do not belong to any PCC yet. In the second step, CUDA-DClus processes each vertex vi e V. If vi does not belong yet to a PCC, CUDA-DClus creates a new PCC and it puts v^ in this component; when a vertex v is added to a PCC, the index this PCC has in the array ArrPCC is stored in the array PCV, at the entry referred to by the index v has in array V; this is the way CUDA-DClus uses to indicate that now v belongs to a PCC. Following, CUDA-DClus computes the similarity between vi and the vertices in Sucv., using the strategy proposed by CUDA-OClus. Once these similarities have been computed, CUDA-DClus visits each vertex u e Sucvi. If S(vi,u) > ß and u does not belong to any PCC yet, then u is inserted in the PCC to which vi belongs and the adjacency lists of both vertices u and vi are modified in order to indicate they are similar to each other; otherwise, if u already belongs to a PCC only the adjacency lists of both vertices are modified. In this last case, if the partial connected components to which both vi and u belong are not the same, we will say that these partial connected components are linked.
As an example, let Gß be initially empty and D = {d1,d2,... ,d9} be the set of documents that will be added to the collection. For the sake of simplicity, we will assume that CUDA-DClus already added the documents in Gß as vertices and that the similarities existing between each pair of documents are those showed in Table 1. Taking into account the above mentioned information, Figures 9, 10, 11 and 12 exemplify how CUDA-DClus builds the array of partial connected components representing graph Gß, for ß = 0.3.
As it can be seen from Figure 9, firstly CUDA-DClus processes vertex v1 in order to build the first PCC. As the result of the above process vertices v3 and v5 are added to the first PCC, which is now constituted by vertices v1, v3 and v5. The second PCC is built when vertex v2 is processed, see Figure 10; this component is finally constituted by vertices v2, v4 and v7. Afterwards, as it can be seen from Figure 11, when vertex v3 is being processed, CUDADClus updates the first PCC by adding vertex v6 and updating the adjacency list of vertices v3 and v5; CUDA-DClus also updates the second PCC by modifying the adjacency list of vertex v4, which is similar to vertex v3. In this example, these two partial connected components were joined by a dash line in order to illustrate the fact that they are linked since vertices v3, belonging to the first PCC, and v4, belonging to the second PCC, are similar. Finally, the third PCC is created when CUDA-DClus processes vertex v8, as it can be seen in Figure 12. The processing of vertices v4,v5,v6, v7 and v9 does not affect the partial connected components built so far, therefore, it was not included in the example.
We would like to emphasize two facts about the above commented process. The first fact is that, since this is the first time the array ArrPCc representing Gß is built, all these components are already in system memory. The second fact is that if we put a PPC Pi G ArrPCC into a set Q Pi and then, iteratively we add to Q Pi all the linked PCC of each PCC belonging to QPi, the resulting set is a connected component. Proof is straightforward by construction. Hereinafter, we will say that QPi is the connected component induced by PCC Pi.
Once the array ArrPCC representing (Gß was built, CUDA-DClus processes each of its partial connected components in order to build the clustering. For processing a PCC Pi G ArrPCC that has not been processed in a previous iteration, CUDA-DClus first builds Q Pi and then, CUDA-DClus recomputes the relevance of the vertices be- longing to this component using the strategy proposed by CUDA-OClus. Once the relevance of the vertices have been recomputed, CUDA-DClus follows the same steps used by DClustR for updating the covering and consequently, the clustering of QPi. Remaining steps of DClustR were not implemented in CUDA because they are more favored with a CPU implementation. Once the clustering has been updated, CUDA-DClus stores the existing partial connected components in the hard drive, releasing in this way the system memory.
Once Gß changes due to the additions of documents to the collection^CUDA-DClus updates the array ArrPCc representing Giß and then, it updates the current clustering. In order to update the array ArrPCC, CUDA-DClus adds for each incoming document, a vertex in (Gß and then, CUDA-DClus sets to -1 the entries that these vertices occupy in PCV, in order to express that they do not belong to any PCC yet. Let M = {vb v2,.. .,vk} be the set of added vertices. Afterwards, for processing a vertex v· € M, CUDA-DClus slightly modifies the strategy it uses for creating the partial connected components. Now, rather than computing the similarity of v· only with the vertices that came after v· in V (i.e., Sucvi), CUDA-DClus also computes the similarity of v· with respect to the vertices that belong to (Gß before the changes; that is, the similarities are now computed between v· and each vertex in Sucvi U (V \ M). Remaining steps are the same.
Let D1 = {d10, du,..., di5} be the set of documents that were added to the collection represented by graph (Gß, whose array of partial connected components was built in Figure 9, and let v10, v1i,..., v15 be the vertices that were consequently added in (Gß by CUDA-DClus. For the sake of simplicity, in the example it is assumed that none of the vertices belonging to Gß before the changes is similar to the added vertices, with the only exception of v2 whose similarity with v10 is 0.5. Table 2 shows the similarities between each pair of the added vertices. Figures 13, 14, 15 and 16 show, assuming ß = 0.3, how CUDA-DClus updates the array of partial connected components representing Gß after the above mentioned additions. In these figures, vertices filled with light gray are those that were added to the collection.
As it can be seen in Figure 13, firstly, CUDA-DClus processes vertex v10 and, as a result of this processing, another PCC is created for containing vertices v10, v11, v12 and v13. This new PCC was joined with the PCC determined by vertex v2, through a dash line, in order to reflect the fact that they are linked since vertices v2 and v10 are similar. Furthermore, as it can be seen in Figures 14 and 15, this fourth PCC is updated when vertices v11 and v12 are processed, in order to reflect the fact that they are similar to vertex v13. Finally, a fifth PCC is created when vertex v14 is processed, see Figure 16; this PCC contains vertices v14 and v15.
Once the array ArrPCc has been updated, CUDADClus processes each new PCC following the same strategy commented above, in order to update the current clustering. It is important to highlight that, different from when ArrPCC was created, this time the partial connected components loaded into the system memory are those belonging to the connected components determined by each new created PCC; the other partial connected components remain in the hard drive. Although in the worst scenario an incoming document can be similar to all existing documents in the collection, generally similarity graphs are very sparse so it is expected that the new representation proposed by CUDA-DClus as well as the strategy it uses for updating the array of partial connected components, help CUDA-DClus to save system memory.
4Experimental results
In this section, the results of several experiments done in order to show the performance of the CUDA-OClus and CUDA-DClus algorithms are presented. The experiments were conducted over eight document collections and were focused on: (1) assessing the correctness of the proposed parallel algorithms wrt. their original non parallel versions, (2) evaluating the improvement achieved by the proposed algorithms with respect to the original OClustR and DClustR algorithms, and (3) evaluating the memory both CUDA-DClus and DClustR consume when they are processing the same collection. All the algorithms were implemented in C++; the codes of OClustR and DClustR algorithms were obtained from their authors. For implementing CUDA-OClus and CUDA-DClus the CUDA Toolkit 5.5 was used. All the experiments were performed on a PC with Core i7-4770 processor at 3.40 GHz, 8GB RAM, having a PCI express NVIDA GeForce GT 635, with 1 GB DRAM.
The document collections used in our experiments were built from two benchmark text collections commonly used in documents clustering: Reuters-v2 and TDT2. The Reuters-v2 can be obtained from http://kdd.ics.uci.edu, while TDT2 benchmark can be obtained from http://www.nist.gov/speech/tests/tdt.html. From these benchmarks, eight document collections were built. The characteristics of these collections are shown in Table 3. As it can be seen from Table 3, these collections are heterogeneous in terms of their size, dimension and the average size of the documents they contain.
In our experiments, documents were represented using the Vector Space Model (VSM) [13]. The index terms of the documents represent the lemmas of the words occurring at least once in the collection; these lemmas were extracted from the documents using Tree-tagger1. Stop words such as: articles, prepositions and adverbs were removed. The index terms of each document were statistically weighted using their term frequency. Finally, the cosine measure was used to compute the similarity between two documents [9].
4.1 Correctness evaluation
As it was mentioned before, the first experiment was focused on assessing the correctness of the proposed algorithms. With this aim, we will compare the clusterings built by CUDA-OClus and CUDA-DClus with respect to those built by the original OClustR and DClustR algorithms, under the same conditions. For evaluating CUDA-OClus we selected the Reu-10K, Reu-20K, Reu-30K, Reu-40K and Reu-50K collections; whilst for evaluating CUDA-DClus we selected Reu-10K, Reu-20K and Reu-30K collections. These collections were selected due to they resemble the collections over which both OClustR and DClustR were evaluated in [6] and [7], respectively.
In order to evaluate CUDA-OClus, we executed OClustR and CUDA-OClus over the Reu-10K, Reu-20K, Reu-30K, Reu-40K and Reu-50K collections, using ß = 0.25 and 0.35. We used these threshold values as these values obtained the best results in several collections as reported in the original OClustR [6] and DClustR [7] articles. Then, we took the clustering results obtained by OClustR as ground truth and we evaluateds the clustering results obtained by CUDA-OClus in terms of their accuracy, using the FBcu bed [14] and the Normalized Mutual Information (NMI) [15] external evaluation measures.
FBcubed is one of the external evaluation measures most used for evaluating overlapping clustering algorithms and unlike of other external evaluation metrics, it meets with four fundamental constrains proposed in [14] (cluster homogeneity, cluster completeness, rag bag and cluster size vs quantity). On the other hand, NMI is a measure of similarity borrowed from information theory, which has proved to be reliable [15]. Both metrics take values in [0,1], where 1 means identical results and 0 completely different results. In order to take into account the inherent data order dependency of CUDA-OClus, we executed CUDA-OClus twenty more times over the above mentioned collections, for each parameter value, varying the order of their documents. Table 4 shows the average FBcubed and NMI values attained by CUDA-OClus for each selected collection, using ß = 0.25 and 0.35.
As it can be seen from Table 4, the average FBcubed and NMI values attained by CUDA-OClus are very close to 1, meaning that the clusters CUDA-OClus builds are almost identical to those built by OClustR. The differences between the clusterings are caused by the inherent data order dependency of the algorithms and also because of the different floating point arithmetic used by CUDA.
In order to asses the validity of CUDA-DClus, in the second part of the first experiment, we will compare the clustering results built by CUDA-DClus with respect to those obtained by DClustR. With this aim, we obtain a ground truth by executing DClustR over the Reu-30K collection, also using ß = 0.25 and ß = 0.35, and then, we process Reu-20K and Reu-10K collections, in this order, as if they were additions of documents to the collection. That is, we are going to add the documents contained in Reu20K to the current collection (i.e., Reu-30K) and update the clustering using DClustR and after that, we are goind to add Reu-10K to the collection resulting from previous additions (i.e., Reu-30K union Reu-20K) and update the clustering again. We repeated the above mentioned execution under the same parameter configuration but using CUDA-DClus instead of DClustR and afterwards. Then, we take the results obtained by DClustR as ground truth and we evaluate each of the three clustering results obtained by CUDA-DClus in terms of their accuracy, using the FBcubed and NMI external evaluation measures. Like in the first part of this experiment, we executed CUDA-DClus twenty times under the above mentioned experimental configuration, each time varying the order of the documents inside the collections. Table 5 shows the average FBcubed and NMI values attained by CUDA-DClus for each selected collection, using ß = 0.25 and 0.35.
As it can be seen from Table 5, the average FBcubed and NMI values attained by CUDA-DClus are very close to 1, meaning that the clusters it builds are almost identical to those built by DClustR. From this first experiment, we can conclude that the speed-up attained by CUDA-OClus and CUDA-DClus does not degrade their accuracy wrt. the original non parallel versions.
4.2Execution time evaluation
In the second experiment, we evaluate the time improvement achieved by CUDA-OClus and CUDA-DClus with respect to OClusR and DClustR, respectively. With this aim, we execute both OClustR and CUDA-OClus over Reu-10K, Reu-20K, Reu-30K, Reu-40K, Reu-50K, Reu60K and Reu-70K, using ß = 0.25 and 0.35 and we measured the time they spent. Like in the previous experiment, in order to take into account the data order dependency of both algorithms, we repeated the above mentioned executions twenty times, for each collection and each parameter configuration, but varying the order of the documents of the collections. Figure 17 shows the average time both OClustR and CUDA-OClus spent for clustering each selected collection, for each parameter configuration.
As it can be seen from Figure 17, CUDA-OClus is faster than OClustR over each selected dataset and for both values of ß; for ß = 0.25 and ß = 0.35, CUDA-OClus is respectively 1.26x and 1.29x faster than OClustR. It is important to note from Figure 17 that as the size of the processed collection grows, the difference in the time spent for each algorithm also grows; this behavior shows how well CUDA-OClus scale when the size of the collection grows. We would like to highlight the fact that the specifications of the computer used in the experiments provided advantage to CPU-based algorithms over GPU-based algorithms, since a Core i7-4770 processor at 3.40 GHz with 8GB RAM is superior to a PCI express NVIDA GeForce GT 635, with 1 GB DRAM, which only has two streaming processors and a limited memory. Hence, taking into account the execution model of a GPU, in which the grid blocks are numerated and they distributed among all streaming multiprocessors, which execute simultaneously one task over a specific block, then we expect that if we use a powerful GPU with more streaming multiprocessor, the difference between the processing time achieved by parallel version and sequential version will be higher than the one showed in this experiments.
In order to compare both DClustR and CUDA-DClus, in the second part of the second experiment, we clustered the Reu-50K collection using both algorithms and then, we measured the time each algorithm spent for updating the current clustering each time N documents of Tdt-65K collection are incrementally added to the existing collection. In this experiment we also used ß = 0.25 and 0.35, and we set N = 5000 and N = 1000 which are much greater values than those used to evaluate DClustR [7]. In order to take into account the data order dependency of both algorithms, the above mentioned executions were also repeated twenty times, for each collection and each parameter configuration, but varying the order of the documents of the collections. Figure 18 shows the average time both DClustR and CUDA-DClus spent for updating the current clustering, for each parameter configuration.
As it can be seen from Figure 18, CUDA-DClus has a better behavior than DClustR, for each parameter configuration, when multiple additions are processed over the selected dataset, showing an average speed up of 1.25x and 1.29x for ß = 0.25, N = 5000 and ß = 0.35, N = 5000 respectively. Moreover, it also showed an average speed up of 1.19x and 1.26x for ß = 0.25, N = 10000 and ß = 0.35, N = 10000 respectively. As in the first part of this experiment, it can be seen also from Figure 18, that the behavior of CUDA-DClus, with respect to that of DClustR, becomes better as the size of the collection grows; in this way, we can say that CUDA-DClus also scales well as the size of the collection grows.
4.3Memory use evaluation
Although the spatial complexity of both algorithm is O(\V| + Eß ), the strategy CUDA-DClus proposes for representing Gß should allow to reduce the amount of memory needed to update the clustering each time the collection changes. Thus, in the third experiment, we compare the amount of memory used by CUDA-DClus against that used by DClustR, when processing the changes performed in the second experiment. The amount of connected component loaded by both algorithms when they are updating the current clustering after changes, is directly proportional to the memory used. Based on this, Figure 19 shows the average number of connected components (i.e., Ave. NCC) each algorithm load into system memory, when processing the changes presented in Figure 18, for each parameter configuration.
As it can be seen from Figure 19, CUDA-DClus consumes less memory than DClustR, for each parameter configuration, thereby hence resulting the memory usage of CUDA-DClus is respectively 22.43% for ß = 0.25 and 42.46% for ß = 0.35 less than the one of DClustR. The above mentioned characteristic, plus the fact that CUDADClus is also faster than DClustR, makes CUDA-DClus suitable for applications processing large document collections.
We would like to highlight that in the worst scenario, if the clustering of all the connected components needs to be updated, all the partial connected components will be loaded to system memory and thus, our proposed CUDADClus and DClustR will have a similar behavior. Additionally, taking into account the results of experiments in sections 4.1 and 4.2, we can conclude that the strategy proposed for reducing the memory used by CUDA-DClus does not include any considerable cost in the overall processing time of CUDA-DClus or in its accuracy.
5Conclusions
In this paper, we introduced two GPU-based parallel versions of the OClustR and DClustR clustering algorithms, namely CUDA-OClus and CUDA-DClus, specifically tailored for document clustering. CUDA-OClus proposes a strategy in order to speed up the most time consuming steps of OClustR. This strategy is reused by CUDA-DClus in order to speed up the most time consuming steps of DClustR. Moreover, CUDA-DClus proposes a new strategy for representing the graph Giß that DClustR uses for representing the collection of documents. This new representation allows CUDA-DClus to reduce the amount of memory it needs to use and also it helps CUDA-DClus to avoid rebuilding the affected components each time the collection changes but still keep them updated after each changes, with no extra cost.
The proposed parallel algorithms were compared against their original versions, over several standard document collections. The experiments were focused on: (a) assess the correctness of the proposed parallel algorithms, (b) evaluate the speed-up achieved by CUDA-OClus and CUDA-DClus with respect to OClustR and DClustR, respectively, and (c) evaluate the memory both CUDA-DClus and DClusR consumes when they are processing changes. From the experiments, it can be seen that both CUDAOClus and CUDA-DClus are faster than OClustR and DClustR, respectively, and that the speed up these parallel versions attain do not degrade their accuracy. The experiments also showed that CUDA-DClus consumes less memory than DClustR, when both algorithms are processing changes over the same collection.
Based on the obtained results, we can conclude that both CUDA-OClus and CUDA-DClus enhance the efficiency of OClustR and DClustR, respectively, in problems dealing with a very large number of documents. These parallel algorithms could be useful in applications, like for instance news analysis, information organization and profiles identification, among others. We would like to mention, that even when the proposed parallel algorithms were specifically tailored for processing documents with the purpose of using the cosine measure, the strategy they propose can be easily extended to work with other similarity or distance measures like, for instance, euclidean and manhattan distances.
As future work, we are going to explore the use in CUDA-OClus and CUDA-DClus of other types of memories in GPU such as texture memory, which is a faster memory than the one both CUDA-OClus and CUDA-DClus are using now. Besides, we are going to evaluate both algorithms over more faster GPU cards, in order to have a better insight of the performance of both algorithms when the number of CUDA cores are increased.
1http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger
References
[1] E. Bae, J. Bailey, G. Dong, A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings, Data Mining and Knowledge Discovery 21 (3) (2010) 427-471.
[2] A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM computing surveys (CSUR) 31 (3) (1999) 264-323.
[3] S. Gregory, A fast algorithm to find overlapping communities in networks, in: Machine learning and knowledge discovery in databases, Springer, 2008, pp. 408-423.
[4]J. Aslam, K. Pelekhov, D. Rus, Static and dynamic information organization with star clusters, in: Proceedings of the seventh international conference on Information and knowledge management, ACM, 1998, pp. 208-217.
[5] A. Pons-Porrata, R. Berlanga-Llavori, J. RuizShulcloper, J. M. Pérez-Martínez, Jerartop: A new topic detection system, in: Progress in Pattern Recognition, Image Analysis and Applications, Springer, 2004, pp. 446-453.
[6] A. Pérez-Suárez, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, J. E. Medina-Pagola, Oclustr: A new graph-based algorithm for overlapping clustering, Neurocomputing 121 (2013) 234-247.
[7] A. Pérez-Suárez, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, J. E. Medina-Pagola, An algorithm based on density and compactness for dynamic overlapping clustering, Pattern Recognition 46 (11) (2013) 3040-3055.
[8] L. J. G. Soler, A. P. Suárez, L. Chang, Efficient overlapping document clustering using gpus and multicore systems, in: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications - 19th Iberoamerican Congress, CIARP 2014, Puerto Vallarta, Mexico, November 2-5, 2014. Proceedings, 2014, pp. 264-271.
[9] M. W. Berry, M. Castellanos, Survey of text mining, Computing Reviews 45 (9) (2004) 548.
[10] R. Gil-García, A. Pons-Porrata, Dynamic hierarchical algorithms for document clustering, Pattern Recognition Letters 31 (6) (2010) 469-477.
[11] J. Sanders, E. Kandrot, CUDA by example: an introduction to general-purpose GPU programming, Addison-Wesley Professional, 2010.
[12] D. B. Kirk, W. H. Wen-mei, Programming massively parallel processors: a hands-on approach, Newnes, 2012.
[13] G. Salton, A. Wong, C.-S. Yang, A vector space model for automatic indexing, Communications of the ACM 18 (11) (1975) 613-620.
[14] E. Amigó, J. Gonzalo, J. Artiles, F. Verdejo, A comparison of extrinsic clustering evaluation metrics based on formal constraints, Information retrieval 12 (4) (2009) 461-486.
[15] A. Lancichinetti, S. Fortunato, J. Kertész, Detecting the overlapping and hierarchical community structure in complex networks, New Journal of Physics 11 (3) (2009) 033015.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2018. This work is published under https://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Pattern Recognition and Data Mining pose several problems in which, by their inherent nature, it is considered that an object can belong to more than one class; that is, clusters can overlap each other. OClustR and DClustR are overlapping clustering algorithms that have shown, in the task of documents clustering, the better tradeoff between quality of the clusters and efñciency, among the existing overlapping clustering algorithms. Despite the good achievements attained by both aforementioned algorithms, they are O(n2) so they could be less useful in applications dealing with a large number of documents. Moreover, although DClustR can efñciently process changes in an already clustered collection, the amount of memory it uses could make it not suitable for applications dealing with very large document collections. In this paper, two GPU-based parallel algorithms, named CUDA-OClus and CUDA-DClus, are proposed in order to enhance the efñciency of OClustR and DClustR, respectively, in problems dealing with a very large number of documents. The experimental evaluation conducted over several standard document collections showed the correctness of both CUDA-OClus and CUDA-DClus, and also their better performance in terms of efficiency and memory consumption.