Informatica 42 (2018) 229–244 229 Static and Incremental Overlapping Clustering Algorithms for Large Collections Processing in GPU Lázaro Janier González-Soler, Airel Pérez-Suárez and Leonardo Chang Advanced Technologies Application Center (CENATA V) 7ma A # 21406, Playa, CP: 12200, Havana, Cuba E-mail: jsoler@ceneatav.co.cu and http://www.cenatav.co.cu/index.php/profile/profile/userprofile/jsoler E-mail: asuarez@cenatav.co.cu and http://www.cenatav.co.cu/index.php/profile/profile/userprofile/asuarez E-mail: lchang@cenatav.co.cu and http://www.cenatav.co.cu/index.php/profile/profile/userprofile/lchang Keywords: data mining, clustering, overlapping clustering, GPU computing Received: November 18, 2016 Pattern Recognition and Data Mining pose several problems in which, by their inherent nature, it is con- sidered that an object can belong to more than one class; that is, clusters can overlap each other. OClustR and DClustR are overlapping clustering algorithms that have shown, in the task of documents clustering, the better tradeoff between quality of the clusters and efficiency, among the existing overlapping clustering algorithms. Despite the good achievements attained by both aforementioned algorithms, they areO(n 2 ) so they could be less useful in applications dealing with a large number of documents. Moreover, although DClustR can efficiently process changes in an already clustered collection, the amount of memory it uses could make it not suitable for applications dealing with very large document collections. In this paper, two GPU-based parallel algorithms, named CUDA-OClus and CUDA-DClus, are proposed in order to enhance the efficiency of OClustR and DClustR, respectively, in problems dealing with a very large number of documents. The experimental evaluation conducted over several standard document collections showed the correctness of both CUDA-OClus and CUDA-DClus, and also their better performance in terms of efficiency and memory consumption. Povzetek: OClustR in DClustR sta prekrivna algoritma za gruˇ cenje, ki dosegata dobre rezultate, vendar je njuna kompleksnost kvadratnega reda velikosti. V tem prispevku sta predstavljena dva paralelna algo- ritma, ki temeljita na GPU: CUDA-OClus in CUDA-DClus. V eksperimentih sta pokazala zmožnost dela z velikimi koliˇ cinami podatkov. 1 Introduction Clustering is a technique of Machine Learning and Data Mining that has been widely used in several contexts [1]. This technique aims to structure a data set in clusters or classes such that objects belonging to the same class are more similar than objects belonging to different classes [2]. There are several problems that, by their inherent nature, consider that objects could belong to more than one class [3, 4, 5]; that is, clusters can overlap each other. Most of the clustering algorithms developed so far do not consider that clusters could share elements; however, the desire of ade- quately target those applications dealing with this problem, have recently favored the development of overlapping clus- tering algorithms; i.e., algorithms that allow objects to be- long to more than one cluster. An overlapping clustering algorithm that has shown, in the task of documents clus- tering, the better tradeoff between quality of the clusters and efficiency, among the existing overlapping clustering algorithms, is OClustR [6]. Despite the good achievements attained by OClustR in the task of documents clustering, it has two main limitations: 1. It has a computational complexity of O(n 2 ), so it could be less useful in applications dealing with a large amount of documents. 2. It assumes that the entire collection is available be- fore clustering. Thus, when this collection changes it needs to rebuild the clusters starting from scratch; that is, OClustR does not use the previously built cluste- ring for updating the clusters after changes. In order to overcome the second limitation, the DClustR algorithm was proposed by Pérez-Suárez et al. in [7]. DClustR introduced a strategy for efficiently updating the clustering after multiple additions and/or deletions from the collection, making it suitable for handling overlapping clustering in applications where the collection changes fre- quently, specially for those applications handling multi- ple changes at the same time. Nevertheless, DClustR still suffers from the first limitation; that is, like OClustR, it is O(n 2 ). This implies that when the collection grows a lot, the time that DClustR uses for processing the chan- ges could make it less useful in real applications. Moreo- ver, when the collection grows, the memory space used by 230 Informatica 42 (2018) 229–244 L.J. González-Soler et al. DClustR for storing the data it needs will also grow, ma- king DClustR a high memory consumer and consequently, making it not suitable for applications dealing with large collections. Motivated by the above mentioned facts, in this work we extend both OClustR and DClustR for efficiently processing very large document collections. A technique that has been widely used in recent years in order to speed-up computing tasks is parallel computing and specifically, GPU computing. A GPU is a device that was initially designed for processing algorithms belonging to the graphical world, but due to its low cost, its high level of parallelism and its optimized floating-point operations, it has been used in many real applications dealing with a large amount of data. The main contribution of this paper is the proposal of two GPU-based parallel algorithms, namely CUDA- OClus and CUDA-DClus, which enhance the efficiency of OClustR and DClustR, respectively, in problems dealing with a very large number of documents, like for instance news analysis, information organization and profiles iden- tification, among others. Preliminary results of this paper were published in [8]. The main differences of this paper with respect to the conference paper presented in [8] are the following: (1) we introduce a new GPU-based algorithm, named CUDA- DClus, which is a parallel version of the DClustR algo- rithm, that is able to efficiently process changes in an alre- ady clustered collection and to efficiently process large col- lections of documents, and (2) we introduce a strategy for incrementally building and updating the connected compo- nents presented in a graph, allowing CUDA-DClus to mi- nimize the memory needed for processing the whole col- lection. It is important to highlight that in CUDA-DClus we only analyze the additions of objects to the collection, be- cause this is the case in which it could be difficult to apply DClustR in real applications dealing with large collections, since this is the case that makes the collection grow. The remainder of this paper is organized as follows: in Section 2, a brief description of both the OClustR and DClustR algorithms are presented. In Section 3, the CUDA-OClus and CUDA-DClus parallel clustering algo- rithms are proposed. An experimental evaluation, showing the performance of both proposed algorithms on several document collections, is presented in Section 4. Finally, the conclusions as well as some ideas about future directi- ons are presented in Section 5. 2 OClustR and DClustR algorithms In this section, both the OClustR [6] and DClustR [7] al- gorithms are described. Since DClustR is the extension of OClustR for efficiently processing collections that can change due to additions, deletions and modifications, the OClustR is first introduced and then, the strategy used by DClustR for updating the clustering after changes is pre- sented. All the definitions and examples presented in this section were taken from [6, 7]. 2.1 The OClustR algorithm In order to build a set of overlapping clusters from a col- lection of objects, OClustR employs a strategy comprised of three stages. In the first stage, the collection of objects is represented by OClustR as a weighted thresholded simila- rity graph. Afterwards, in the second stage, an initial set of clusters is built through a cover of the graph representing the collection, using a special kind of subgraph. Finally, in the third stage the final set of overlapping clusters is obtai- ned by improving the initial set of clusters. Following, each stage is briefly described. Let O = fo 1 ;o 2 ;:::;o n g be a collection of objects, 2 [0;1] a similarity threshold, and S:O O!< a sym- metric similarity function. A weighted thresholded simila- rity graph, denoted as e G =hV; e E ;Si, is an undirected and weighted graph such thatV = O and there is an edge (v;u)2 e E iffS(v;u) ; each edge(v;u)2 e E ;v6=u is labeled with the value of S(v;u). As it can be infer- red, in the first stage OClustR must compute the similarity between each pair of objects; thus, the computational com- plexity of this stage isO(n 2 ). Once e G is built, in the second stage OClustR builds an initial set of clusters through a covering of e G , using weighted star-shaped sub-graphs. Let e G =hV; e E ;Si be a weighted thresholded simila- rity graph. A weighted star-shaped sub-graph (ws-graph) in e G , denoted by G ? =hV ? ;E ? ;Si, is a sub-graph of e G , having a vertexc2V ? , called the center ofG ? , such that there is an edge betweenc and all the other vertices in V ? nfcg; these vertices are called satellites. All vertices in e G having no adjacent vertices (i.e., isolated vertices) are considered degenerated ws-graphs. For building a covering of e G using ws-graphs, OClustR must build a setW =fG ? 1 ;G ? 2 ;:::;G ? k g of ws-graphs of e G , such thatV = S k i=1 V ? i , beingV ? i ;8i = 1:::k, the set of vertices of the ws-graph G ? i . For solving this pro- blem, OClustR searches for a list C = fc 1 ;c 2 ;:::;c k g, such thatc i 2 C is the center ofG ? i 2 W ,8i = 1::k. In the following, we will say that a vertex v is covered if it belongs toC or if it is adjacent to a vertex that belongs to C. For pruning the search space and for establishing a cri- terion in order to select the vertices that should be included inC, the concept of relevance of a vertex is introduced. The relevance of a vertexv, denoted asv:relevance, is defined as the average between the relative density and the relative compactness of a vertexv, denoted asv:densityR and v:compactnessR, respectively, which are defined as follows: v:densityR = jfu2v:Adj=jv:Adjjj u:Adjjgj jv:Adjj ; v:compactnessR = jfu2v:Adj=AIS(G ? v ) AIS(G ? u )gj jv:Adjj ; wherev:Adj andu:Adj are the set of adjacent vertices ofv andu, respectively;G ? v andG ? u are the ws-graphs de- termined by verticesv andu, andAIS(G ? v ) andAIS(G ? u ) Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 231 are the approximated intra-cluster similarity ofG ? v andG ? u , respectively. The approximated intra-cluster similarity of a ws-graphG ? is defined as the average weight of the edges existing inG ? between its center and its satellites. Based on the above definitions, the strategy that OClustR uses in order to build the listC is composed of three steps. First, a candidate listL containing the vertices having rele- vance greater than zero is created; isolated vertices are di- rectly included inC. Then,L is sorted in decreasing order of their relevance and each vertexv2 L is visited. Ifv is not covered yet or it has at least one adjacent vertex that is not covered yet, thenv is added toC. Each selected vertex, together with its adjacent vertices, constitutes a cluster in the initial set of clusters. The second stage of OClustR also has a computational complexity ofO(n 2 ). Figure 1 shows through an example, the steps performed by OClustR in the second stage for building the initial set of clusters. Finally, in the third stage, the final clusters are obtained though a process which aims to improve the initial clusters. With this aim, OClustR processesC in order to remove the vertices forming a non-useful ws-graph. A vertexv forms a non-useful ws-graph if: a) there is at least another ver- texu2 C such that the ws-graphu determines includesv as a satellite, andb) the ws-graph determined byv shares more vertices with other existing ws-graphs than those it only contains. For removing non useful vertices, OClustR uses three steps. First, the vertices inC are sorted in des- cending order according to their number of adjacent verti- ces. After that, each vertex v 2 C is visited in order to remove those non-useful ws-graphs determined by vertices in(v:Adj\C). If a ws-graphG ? u , withu2 (v:Adj\C), is non-useful,u is removed fromC and the satellites it only covers are “virtually linked” tov by adding them to a list namedv:Linked; in this way, those vertices virtually lin- ked to v will also belong to the ws-graph v determines. Once all vertices in (v:Adj\C) are analyzed, v together with the vertices inv:Adj andv:Linked constitute a final cluster. This third stage also has a computational complex- ity ofO(n 2 ). Figure 2 shows through an example, how the final clusters are obtained from the initial clusters showed in Figure 1(d). 2.2 Updating the clusters after changes: the DClustR algorithm Let e G = hV; e E ;Si be the weighted thresholded simi- larity graph that represents an already clustered collection O. LetC =fc 1 ;c 2 ;:::;c k g be the set of vertices repre- senting the current covering of e G and consequently, the current clustering. When some vertices are added to and/or removed from O (i.e., from e G ), there could happen the following two situations: 1) Some vertices become uncovered. This situation occurs when at least one of the added vertices is unco- vered or when those vertices ofC covering a specific vertex were deleted from e G . 0.15 0.20 0.30 0.20 0.40 0.30 0.10 0.10 0.10 0.10 0.10 0.20 0.20 0.20 0.20 0.50 0.40 0.20 0.16 0.30 0.50 0.70 0.32 0.40 0.20 0.25 0.20 (a) A weighted thresholded similarity graph e G 0 0 0.9 0.5 0.25 0.3 0.92 0.84 0 0 0 0.84 0 0.17 0.6 0 0.5 0.5 0.5 0 0 0.67 0.5 (b) e G using relevance for labeling the vertices 0 0 0.9 0.5 0.25 0.3 0.92 0.84 0 0 0.84 0 0.17 0.6 0 0.5 0.5 0.5 0 0 0.67 0 0.5 (c) Vertices belonging to setC 0 0 0.9 0.5 0.25 0.3 0.92 0.84 0 0 0.84 0 0.17 0.6 0 0.5 0.5 0.5 0 0 0.67 0 0.5 (d) Set of initial clusters Figure 1: Illustration of how OClustR builds the initial set of clusters. 232 Informatica 42 (2018) 229–244 L.J. González-Soler et al. 0 0 0.9 0.5 0.25 0.3 0.92 0.84 0 0 0.84 0 0.17 0.6 0 0.5 0.5 0.5 0 0 0.67 0 0.5 (a) Vertices determining non-useful ws-graphs (filled with light gray) 0 0 0.9 0.5 0.25 0.3 0.92 0.84 0 0 0.84 0 0.17 0.6 0 0.5 0.5 0.5 0 0 0.67 0 0.5 (b) Final set of overlapping clusters Figure 2: Illustration of how the final clusters are obtained by OClustR in the third stage. 2) The relevance of some vertices changes and, as a con- sequence, at least one vertexu = 2C appears such that u has relevance greater than at least one vertex inC that covers vertices in u:Adj[fug. Vertices like u could determine ws-graphs with more satellites and less overlapping with other ws-graphs than other ws- graphs currently belonging to the covering of e G . Figure 3, illustrates the above commented situations over the graph e G of Figure 1(a). Figure. 3(a), shows the graph e G before the changes; the vertices to be removed are mar- ked with an “x”. Figure 3(b), shows graph e G after the changes; vertices filled with light gray represent the added vertices. Figures 3(c) and 3(d), show the updated graph e G with vertices labeled with letters and with their up- dated value of relevance, respectively; vertices filled with black correspond with those vertices currently belonging to C. As it can be seen from Figures 3(c) and 3(d), vertices S;F;G;I;H and J became uncovered after the changes, while vertex B, which does not belong to C, has a rele- vance greater than vertexD, which already belongs toC. Taking into account the above mentioned situations, in order to update the clustering after changes DClustR first detects which are the connected components of e G that were affected by changes and then it iteratively updates the covering of these components and consequently, their clus- 0.15 0.20 0.30 0.20 0.40 0.30 0.10 0.10 0.10 0.10 0.10 0.20 0.20 0.20 0.20 0.50 0.40 0.20 0.16 0.30 0.50 0.70 0.32 0.40 0.20 0.25 0.20 x x x x x (a) e G before the changes 0.15 0.20 0.30 0.20 0.30 0.40 0.16 0.30 0.50 0.32 0.40 0.20 0.25 0.20 0.21 (b) e G after the changes A C B D E F G H J K M L N O P Q I S R (c) e G with vertices labeled with letters 0 0 0.88 0.67 0.5 0 0 0.75 0.5 0.67 0 0.5 0.5 0.5 0 0 0 0 0.7 (d) e G with vertices labeled with their updated value of relevance Figure 3: Illustration of how some changes in the collection affect the current covering of the graph e G of Figure 2(b). Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 233 tering. The connected components that are affected by changes are those that contain vertices that were added or verti- ces that were adjacent to vertices that were deleted from e G . Since DClustR has control over these vertices it can build these components through a depth first search, star- ting from any of these vertices. Let G 0 =hV 0 ;E 0 ;Si be a connected component affected by changes, whose cove- ring must be updated. Let C 0 C be the set of vertices of G 0 which determine ws-graphs (i.e., clusters) covering G 0 . DClustR follows the same principles of OClustR; that is, it first builds or completes the covering ofG 0 in order to build an initial set of clusters (stage 1) and then, it impro- ves these clusters in order to build the final set of clusters ofG 0 (stage 2). In fact, DClustR uses the same steps that OClustR for the above two mentioned stages, but unlike OClustR, DClustR modifies the way in which the candi- date listL, used in stage 1, is built. In order to build candidate listL, DClustR first recom- putes the relevance value of all vertices inG 0 and it empties the listc:Linked, for all verticesc2C 0 ; this last action is supported by the fact that, after changes, there could be ws- graphs that were considered as non useful, which could be no longer so. LetV + (V 0 nC 0 ) be the set of vertices of G 0 with relevance greater than zero, which do not belong toC 0 . For building the candidate listL, bothC 0 andV + are processed. For processingV + , DClustR visits each vertexv2 V + and it verifiesa) ifv is uncovered, orb) if at least one ad- jacent vertex ofv is uncovered, orc) if there is at least one vertexu2 v:Adj, such that there is no other vertex inC 0 coveringu whose relevance is greater than or equal to the relevance ofv. If any of these three conditions is fulfilled, v is added to L. Additionally, if the last condition is ful- filled, all those vertices likeu are marked as “activated” in order to use them whenC 0 is being processed. The compu- tational complexity of the processing ofV + isO(n 2 ). For processingC 0 , DClustR visits the adjacent vertices of each vertexv2C 0 . Any vertexu2v:Adj having grea- ter relevance thanv is added toL; in these cases,v is addi- tionally marked as “weak”. Once all the adjacent vertex of v were visited, ifv was marked as “weak” or at least one of its adjacent vertices were previously marked as “active”,v is removed fromC 0 since it could be substituted by a more relevant vertex. However, ifv has a relevance greater than zero, it is still considered as a candidate and consequently, it is added toL. The computational complexity of the pro- cessing ofC 0 isO(n 2 ). Figure 4, shows the updated set of overlapping clusters obtained by DClustR when it processes the graph in Fi- gure 3(d); vertices filled with black represent the vertices determining ws-graphs that cover each connected compo- nent of e G . Like OClustR, the computational complexity of DClustR isO(n 2 ). 0 0 0.88 0.67 0.5 0 0 0.75 0.5 0.67 0 0.5 0.5 0.5 0 0 0 0 0.7 Figure 4: Updated set of overlapping clusters obtained by DClustR. 3 Proposed parallel algorithms As it was mentioned in Section 1, despite the good achie- vements attained by OClustR and DClustR in the task of documents clustering, these algorithms areO(n 2 ) so they could be less useful in applications dealing with a very large number of documents. Motivated by this fact, in this section two massively parallel implementations in CUDA of OClustR and DClustR are proposed in order to enhance the efficiency of OClustR and DClustR in the above menti- oned problems. These parallel algorithms, namely CUDA- OClus and CUDA-DClus, take advantage of the benefits of GPUs, like for instance, the high bandwidth communica- tion between CPU and GPU, and the GPU memory hierar- chy. Although in their original articles both OClustR and DClustR were proposed as general purpose clustering al- gorithms, the parallel extensions proposed in this work are specifically designed for processing documents. This ap- plication context is the same in which both OClustR and DClustR were evaluated and it is also a context in which very large collections are commonly processed. In the context of document processing, both CUDA-OClus and CUDA-DClus use the cosine measure [9] for computing the similarity between two documents; this measure is the function that has been used the most for this purpose [10]. The cosine measure between two documents d i and d j is defined as: cos(d i ;d j ) = P m k=1 d i (k) d j (k) kd i kkd j k ; (1) whered i (k) andd j (k) are the weights of thek term in the description of the documentsd i andd j , respectively;kd i k andkd j k are the norms of documentsd i andd j , respecti- vely. In experiments conducted over several document col- lections, it was verified that the first stage of OClustR, the construction of the similarity graph, consumes the 99% of the processing time of the algorithm. The remaining 1% is mainly dominated by the computation of the relevance of the vertices. Based on this fact, the above two mentio- ned steps are the ones that will be implemented in CUDA 234 Informatica 42 (2018) 229–244 L.J. González-Soler et al. by CUDA-OClus; remaining steps are high memory con- suming tasks that are more favored with a CPU implemen- tation. Analogously, in these experiments it was also ve- rified that the most time consuming steps of DClustR are the updating of the graph after changes and the recompu- ting of the relevance, so these steps will be implemented in CUDA by CUDA-DClus. In this case, it could be no- ticed also that the detection of the connected components affected by changes is a high memory consuming task per- formed by DClustR, so it is also important to address this problem in CUDA-DClus. Finally, it is also important to mention that since we are dealing with the problem of processing very large do- cument collections, CUDA-DClus only tackles additions, which are the changes that could increase the size of the collection. Implementing deletions is irrelevant for overco- ming problems related with large document collections. Following, the CUDA-OClus algorithm is first introdu- ced and then, the CUDA-DClust algorithm is presented. 3.1 CUDA-OClus algorithm LetD =fd 1 ;d 2 :::;d n g be a collection of documents des- cribed by a set of terms. LetT =ft 1 ;t 2 ;:::;t m g be the list containing all the different terms that describe at least one document inD. CUDA-OClus represents a document d i 2 D by two parallel vectors, denoted byT di andW di . The first one contains the position that the terms describing d i have inT , and the second one contains the weights that those terms have in the description ofd i . For building e G =hV; e E ;Si, OClustR demandsS to be a symmetric similarity measure, so the similarity bet- ween any two documents (i.e., vertices in e G ) needs to be computed only once. Based on this fact and considering the inherent order the documents have inside a collection D (i.e., vertices inV ), for building the edges relatives to a vertexv2V it is only necessary to compute the similarity betweenv and each vertex followingv inV . LetSuc v be the list of vertices that follow a vertex v in V . To speed up the construction of e G , for each vertexv2V , CUDA- OClus will compute in parallel the similarity betweenv and the vertices inSuc v . Considering the definition of the cosine measure, it can be seen from Expression (1) that its numerator is a sum of independent products which could be computed all at once. On the other hand, taking into account that the norm of a document can be computed while the document is being read, the denominator of Expression (1) can be also resol- ved with no extra time. Based on these facts, CUDA-OClus also parallelizes the computation of the similarity between a pair of vertices, in order to speed up even more the con- struction of e G . In order to carry out the previous idea, CUDA-OClus builds a grid comprised of k square blocks, each block having a shared memory square matrix (SMM); where k = n t +1 andt is the dimension of both the blocks and the matrices. A grid is a logic representation of a matrix of threads in the GPU. The use of SMM and its low la- tency will allow CUDA-OClus to not constantly access the CPU memory, speeding up the calculus of the similarity between two vertices. CUDA-OClus assigns tot the maxi- mum value allowed by the architecture of the GPU for the dimension of a SMM. When CUDA-OClus is going to compute the similarity between a vertexv and the vertices inSuc v , it first builds a vectorP v of sizem. This vector has zero in all its entries excepting in those expressed by the positions stored inT v ; these last entries contain their respective weights stored in W v . Once P v has been built, the list Suc v is partitioned intok sublists. Each one of these sublists is assigned to a block constituting the grid and the SMM associated with that block is emptied; i.e., all its cells are set to zero. When a sublist Q =fv 1 ;v 2 ;:::;v p g is assigned to a block in- side a grid, each vertex inQ is assigned to a column of the block. In this context, to assign a vertexv i to a column me- ans that each row of the column points to a term describing v i ; in this way, the jth row points to the jth term descri- bingv i . Figure 5 shows an example of how the listSuc v is divided by CUDA-OClus intok sublists and how these su- blists are assigned to the blocks constituting the grid. The example on Figure 5 shows how the first vertex of the su- blist assigned to “block 0” is assigned to the first column of that block; the other assignments could be deduced from this example. … Block 0 Block 1 Block k Grid 2 3 5 8 10 11 12 14 v 1 v 2 v t V t+1 V t+2 V 2t V n Suc v 0.5 0.3 0.4 0.8 0.2 0.6 0.3 0.3 T v 1 W v 1 T v 2 W v 2 … … T v n W v n t=5 Figure 5: Illustration of how CUDA-OClus divides Suc v and assigns each resulting sublist to the blocks. Each row inside a column of a block has a thread that performs a set of operations. In our case, the threads asso- ciated with theith column will compute the similarity bet- weenv and its assigned vertexv i . With this aim, the thread associated with each row inside theith column will com- pute the product between the weight that the term pointed by that row has in the description ofv i , and the weight this same term has in the description of vertexv. It is important to note that although the sum in the numerator of Expres- sion (1) runs over all the terms inT , the products that will be different from zero are only those between terms shared Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 235 by both documents; this is the reason we only use the terms ofv i and multiply their weights by the weights that these terms have inv; remaining terms inv are useless. Given that thejth row of the column to which vertexv i has been assigned, points to thejth term ofT vi , the weight this term has inv i is stored at thejth position ofW vi and the weight this same term has in v is stored at P v , in the entry referred by the value stored at thejth position ofT vi . The result of the product between the above mentioned weights is added to the value the jth row already has in the SMM. If the description of a vertex v i assigned to a column of a block exceeds the length of the column (i.e., t) a tiling is applied at this block. Tiling [11] is a technique that consists on dividing a data set into a number of small subsets, such that each subset fits into the block; i.e., the SMM. Thus, when the rows of a column point at the nextt terms, the products between the weights these terms have in the description of v i and v are computed and accumu- lated into the values these rows have in the SMM. This technique is applied until all the terms describing the verti- ces assigned to the columns have been processed. Figure 6 shows how the similarity between the vertex v 1 assigned to the first column of “Block 0” and v is computed. In this example, it has been assumed that there are 15 terms describing the documents of the collection, the size of the block is k = 5, and T v = f1;2;5;8;10;12;14g, W v = f0:2;0:6;0:3;0:7;0:2;0:1;0:5g, T v1 = f2;3;5;8;10;11;12;14g and W v1 = f0:5;0:3;0:4;0:8;0:2;0:6;0:3;0:3g. As it can be seen from Figure 6(a), each thread of thet rows of the first column computes the product between the weight of the term it points at, and the weight this same term has inP v (i.e., the description ofv). As it was menti- oned before, the computed products are stored in the SMM of that block. Note from Figure 6(a) that the product com- puted by the second row is zero since vertex v does not contain the term pointed out by this row; i.e., term having index 3rd inT . Figure 6(b) shows how when Tiling is ap- plied, the remaining terms describingv 1 are pointed by the rows of the first column. Figure 6(c) shows how the pro- ducts between the remaining terms ofv 1 andv are perfor- med. Finally, Figure 7 shows which are the values stored in the first column of the SMM of “Block 0”, once all the products have been computed. Once all the terms describing the vertices assigned to a block have been processed, a reduction is applied over each column of the block. Reduction [12] is an operation that computes in parallel the sum of all the values of a column of the SMM and then, it stores this sum in the first row of the column. Figure 8 shows the final sum obtained for the first column of “Block 0”. The sum obtained on the column to with vertex v i has been assigned corresponds with the numerator of the co- sine measure betweenv andv i . This sum is then divided by the product of the norms ofv andv i , which have been previously computed; the result of this division (i.e., the si- milarity betweenv andv i ) is copied to the CPU. Using this 0.3 0 0.12 0.56 0.4 Block 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 P v v 1 v 2 v t Suc v T v 2 W v 2 … 0 0.2 0.6 0 0 0.3 0 0 0.7 0 0.2 0 0.1 0 0.5 t=5 2 3 5 8 10 11 12 14 0.5 0.3 0.4 0.8 0.2 0.6 0.3 0.3 T v 1 W v 1 (a) Computing the firstt products 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 P v 0 0.2 0.6 0 0 0.3 0 0 0.7 0 0.2 0 0.1 0 0.5 0.3 0 0.12 0.56 0.4 Block 0 v 1 v 2 v t Suc v T v 2 W v 2 … t=5 2 3 5 8 10 11 12 14 0.5 0.3 0.4 0.8 0.2 0.6 0.3 0.3 T v 1 W v 1 (b) Applying Tiling 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 P v 0 0.2 0.6 0 0 0.3 0 0 0.7 0 0.2 0 0.1 0 0.5 0.3 0 0.12 0.56 0.4 v 1 v 2 v t Suc v T v 2 W v 2 … t=5 2 3 5 8 10 11 12 14 0.5 0.3 0.4 0.8 0.2 0.6 0.3 0.3 T v 1 W v 1 Block 0 (c) Computing remaining products Figure 6: Illustration of how CUDA-OClus computes the similarity between a vertexv and the vertices inSuc v . result CUDA-OClus decides if it should create or not an edge betweenv andv i , during this step CUDA-OClus also updates the value ofAIS(v) andAIS(v i ). 236 Informatica 42 (2018) 229–244 L.J. González-Soler et al. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 P v 0 0.2 0.6 0 0 0.3 0 0 0.7 0 0.2 0 0.1 0 0.5 0.3 0.03 0.27 0.56 0.4 v 1 v 2 v t Suc v T v 2 W v 2 … t=5 2 3 5 8 10 11 12 14 0.5 0.3 0.4 0.8 0.2 0.6 0.3 0.3 T v 1 W v 1 Block 0 Figure 7: Final results stored in the SMM after processing all terms ofv 1 . 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 P v 0 0.2 0.6 0 0 0.3 0 0 0.7 0 0.2 0 0.1 0 0.5 1.56 v 1 v 2 v t Suc v T v 2 W v 2 … t=5 2 3 5 8 10 11 12 14 0.5 0.3 0.4 0.8 0.2 0.6 0.3 0.3 T v 1 W v 1 Block 0 Figure 8: Result of applying Reduction on the first column of “Block 0”. The pseudocode of cosine similarity function is shown in Algorithm 1. Once the thresholded similarity graph e G has been built, CUDA-OClus speeds up the computation of the other time- consuming step: the computation of the relevance of the vertices. In order to do that, CUDA-OClus computes in pa- rallel the relevance of all the vertices of e G . Moreover, for each vertexv, CUDA-OClus computes in parallel the con- tribution each adjacent vertex ofv has over the relevance of v, speeding up even more the computation of the relevance ofv. In order to accomplish this idea, the list of vertices of e G is partitioned intok sublists and each sublist is assigned to a block inside a grid. However, in this case, when a ver- texv i of a sublist is assigned to a column of a block, each row in that column will point to an adjacent vertex of v i ; e.g., thejth row points at thejth adjacent vertex ofv i . Dif- Algorithm 1: CUDA implementation of the cosine similarity function. Input:Sucv the list of vertices that follow a vertexv,Pv weights of vertexv,Wv weights associated tov,Tv position of terms that represent tov,Normv is the norm ofv Output:similarity: cosine similarity values betweenv andSucv 1 __shared__floatSMM[R][C] ; // R = C because block are squared 2 inttid=threadIdx:x+blockDim:x blockIdx:x; 3 if (tid0&&Normu >0) then / * Dividing between the multiplication of norms of v and u * / 21 similarity[tid]= SMM[0][threadIdx:x]=(Normv Normu) ferent from building graph e G , now the threads associated with a column will compute the relevance of its assigned vertex. With this aim, the thread on each row of that co- lumn will compute the contributions the vertex pointed by that row has over the relevance of the vertex assigned to the column. Letv be a vertex assigned to a column andu one of its adjacent vertices. Vertexu contributes 1 jv:Adjj to the rele- vance of v ifjv:Adjjj u:Adjj; otherwise, its contribu- tion is zero. This case represents the contributionu has to the relevance of v through the relative density of v. On the other hand, u contributes 1 jv:Adjj to the relevance ofv ifAIS(v) AIS(u); otherwise, its contribution is zero. This other case represents the contributionu has to the re- levance of v through the relative compactness of v. The total contribution provided by a vertex is added to the va- lue the row already has in the SMM; similar to the case of building graph e G , the SMM of each block is initially emp- tied. Ifv has more thant adjacent vertex, a Tiling is app- lied. Once all the adjacent vertices ofv has been processed, a Reduction is applied in order to compute the relevance of v. Obtained values are then copied to the CPU. The pseudocode of cosine similarity function is shown in Algorithm 2. As it was mentioned before, the remaining steps of OClustR were not implemented in CUDA because they are Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 237 Algorithm 2: CUDA implementation of the rele- vance function. Input: e G weights threshold similarity graph,AIS( e G ) is the approximated intra-cluster similarity of e G Output:relevance: relevance values of vertices 1 __shared__floatSMM[R][C] ; // R = C because block are squared 2 inttid=threadIdx:x+blockDim:x blockIdx:x; 3 if (tid0j) then / * Dividing between the number of adjacents of the current vertex * / 23 relevance[tid]= SMM[threadIdx:y][threadIdx:x]=(2j Adj[tid]j); more favored with a CPU implementation since they are high memory consumption tasks. 3.2 CUDA-DClus algorithm In order to update an already clustered collection when changes take effect, in our case additions, DClustR first de- tects, in the graph e G representing the collection, which are the connected components that were affected by chan- ges and then, it updates the cover of those components and consequently, the overall clustering of the collection. As it was stated in Section 2.2, the connected compo- nents affected by additions are those containing at least one added vertex. Thus, each time vertices are added to e G , in addition to computing the similarity between these vertices and those already belonging to e G in order to create the respective edges, DClustR also needs to build from scra- tch each affected connected component in order to update their covers. In order to reduce the amount of informa- tion DClustR needs to store in memory, CUDA-DClus pro- poses to represent the graph e G using an array of partial connected components, namedArr PCC , and two parallel arrays. The first of these parallel arrays, named V , con- tains the vertices in the order in which they were added to e G . The second array, named PC V , contains the index of the partial connected component to which each vertex belongs. This new representation allows CUDA-DClus to not need to rebuild the affected components each time the collection changes, but keeping the affected components updated each time vertices are added to the graph e G , with no extra cost. Let e G =hV; e E ;Si be the thresholded similarity graph representing the collection of documents. A partial con- nected component (PCC) in e G is a connected subgraph induced by a subset of vertices of e G . A partial connected component is represented using two arrays: one array con- taining the indexes the vertices belonging to that compo- nent have in e G , and the other array containing the adja- cency list of the aforementioned vertices. The array of partial connected components representing e G is built once while e G is being constructed. The stra- tegy used by CUDA-DClus for this purpose is as follows. In the first step, CUDA-DClus adds a vertex inV for each document of the collection and then,PC V is emptied (i.e., it is filled with -1), meaning that the vertices do not be- long to any PCC yet. In the second step, CUDA-DClus processes each vertex v i 2 V . If v i does not belong yet to a PCC, CUDA-DClus creates a new PCC and it putsv i in this component; when a vertexv is added to a PCC, the index this PCC has in the array Arr PCC is stored in the array PC V , at the entry referred to by the index v has in arrayV ; this is the way CUDA-DClus uses to indicate that now v belongs to a PCC. Following, CUDA-DClus com- putes the similarity betweenv i and the vertices inSuc vi , using the strategy proposed by CUDA-OClus. Once these similarities have been computed, CUDA-DClus visits each vertexu2 Suc vi . IfS(v i ;u) andu does not belong to any PCC yet, thenu is inserted in the PCC to whichv i belongs and the adjacency lists of both vertices u and v i are modified in order to indicate they are similar to each other; otherwise, if u already belongs to a PCC only the adjacency lists of both vertices are modified. In this last case, if the partial connected components to which bothv i andu belong are not the same, we will say that these partial connected components are linked. As an example, let e G be initially empty and D = fd 1 ;d 2 ;:::;d 9 g be the set of documents that will be added to the collection. For the sake of simplicity, we will assume that CUDA-DClus already added the documents in e G as vertices and that the similarities existing between each pair of documents are those showed in Table 1. Taking into ac- count the above mentioned information, Figures 9, 10, 11 and 12 exemplify how CUDA-DClus builds the array of partial connected components representing graph e G , for = 0:3. As it can be seen from Figure 9, firstly CUDA-DClus processes vertexv 1 in order to build the first PCC. As the result of the above process verticesv 3 andv 5 are added to the first PCC, which is now constituted by vertices v 1 ;v 3 and v 5 . The second PCC is built when vertex v 2 is pro- cessed, see Figure 10; this component is finally constitu- ted by verticesv 2 ;v 4 andv 7 . Afterwards, as it can be seen 238 Informatica 42 (2018) 229–244 L.J. González-Soler et al. Vert./Vert. v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 v 1 - 0 0.4 0 0.5 0 0 0 0 v 2 0 - 0 0.4 0 0 0.5 0 0 v 3 0.4 0 - 0.7 0.6 0.3 0 0 0 v 4 0 0 0.7 - 0 0 0 0 0 v 5 0.5 0 0.6 0 - 0 0 0 0 v 6 0 0 0.3 0 0 - 0 0 0 v 7 0 0 0 0 0 0 - 0 0 v 8 0 0 0 0 0 0 0 - 0.5 v 9 0 0 0 0 0 0 0 0.5 - Table 1: Similarities existing between each pair of vertices of the example. V v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 PC V 0 -1 0 -1 0 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8 v 1 v 5 v 3 0 2 4 24 0 0 Figure 9: Processing vertexv 1 . V v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 PC V 0 1 0 1 0 -1 1 -1 -1 0 1 2 3 4 5 6 7 8 v 2 v 4 v 7 1 3 6 36 1 1 v 1 v 5 v 3 0 2 4 24 0 0 Figure 10: Processing vertexv 2 . from Figure 11, when vertexv 3 is being processed, CUDA- DClus updates the first PCC by adding vertexv 6 and upda- ting the adjacency list of verticesv 3 andv 5 ; CUDA-DClus also updates the second PCC by modifying the adjacency list of vertexv 4 , which is similar to vertexv 3 . In this exam- ple, these two partial connected components were joined by a dash line in order to illustrate the fact that they are linked since vertices v 3 , belonging to the first PCC, and v 4 , be- longing to the second PCC, are similar. Finally, the third PCC is created when CUDA-DClus processes vertex v 8 , V v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 PC V 0 1 0 1 0 0 1 -1 -1 0 1 2 3 4 5 6 7 8 v 1 v 5 v 3 v 2 v 4 v 7 v 6 0 2 4 5 24 034 5 02 1 3 6 36 1 12 2 Figure 11: Processing vertexv 3 . V v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 PC V 0 1 0 1 0 0 1 2 2 0 1 2 3 4 5 6 7 8 v 1 v 5 v 3 v 2 v 4 v 7 v 6 v 8 v 9 7 8 8 7 0 2 4 5 24 034 5 02 2 1 3 6 36 1 12 Figure 12: Processing vertexv 8 . as it can be seen in Figure 12. The processing of vertices v 4 ;v 5 ;v 6 ;v 7 and v 9 does not affect the partial connected components built so far, therefore, it was not included in the example. We would like to emphasize two facts about the above commented process. The first fact is that, since this is the first time the array Arr PCC representing e G is built, all these components are already in system memory. The se- cond fact is that if we put a PPC P i 2 Arr PCC into a setQ Pi and then, iteratively we add toQ Pi all the linked PCC of each PCC belonging to Q Pi , the resulting set is a connected component. Proof is straightforward by con- struction. Hereinafter, we will say that Q Pi is the con- nected component induced by PCCP i . Once the array Arr PCC representing e G was built, CUDA-DClus processes each of its partial connected com- ponents in order to build the clustering. For processing a PCC P i 2 Arr PCC that has not been processed in a previous iteration, CUDA-DClus first buildsQ Pi and then, CUDA-DClus recomputes the relevance of the vertices be- Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 239 longing to this component using the strategy proposed by CUDA-OClus. Once the relevance of the vertices have been recomputed, CUDA-DClus follows the same steps used by DClustR for updating the covering and conse- quently, the clustering ofQ Pi . Remaining steps of DClustR were not implemented in CUDA because they are more fa- vored with a CPU implementation. Once the clustering has been updated, CUDA-DClus stores the existing partial con- nected components in the hard drive, releasing in this way the system memory. Once e G changes due to the additions of documents to the collection, CUDA-DClus updates the array Arr PCC representing e G and then, it updates the current cluste- ring. In order to update the arrayArr PCC , CUDA-DClus adds for each incoming document, a vertex in e G and then, CUDA-DClus sets to -1 the entries that these verti- ces occupy in PC V , in order to express that they do not belong to any PCC yet. LetM =fv 1 ;v 2 ;:::;v k g be the set of added vertices. Afterwards, for processing a ver- texv i 2M, CUDA-DClus slightly modifies the strategy it uses for creating the partial connected components. Now, rather than computing the similarity ofv i only with the ver- tices that came after v i in V (i.e., Suc vi ), CUDA-DClus also computes the similarity ofv i with respect to the ver- tices that belong to e G before the changes; that is, the si- milarities are now computed betweenv i and each vertex in Suc vi [(VnM). Remaining steps are the same. Let D 1 =fd 10 ;d 11 ;:::;d 15 g be the set of documents that were added to the collection represented by graph e G , whose array of partial connected components was built in Figure 9, and letv 10 ;v 11 ;:::;v 15 be the vertices that were consequently added in e G by CUDA-DClus. For the sake of simplicity, in the example it is assumed that none of the vertices belonging to e G before the changes is similar to the added vertices, with the only exception of v 2 whose similarity with v 10 is 0:5. Table 2 shows the similarities between each pair of the added vertices. Figures 13, 14, 15 and 16 show, assuming = 0:3, how CUDA-DClus upda- tes the array of partial connected components representing e G after the above mentioned additions. In these figures, vertices filled with light gray are those that were added to the collection. v 10 v 11 v 12 v 13 v 14 v 15 v 10 - 0.4 0.3 0.6 0 0 v 11 0.4 - 0 0.4 0 0 v 12 0.3 0 - 0.4 0 0 v 13 0.6 0.4 0.4 - 0 0 v 14 0 0 0 0 - 0.5 v 15 0 0 0 0 0.5 - Table 2: Similarities existing between each pair of added vertices. As it can be seen in Figure 13, firstly, CUDA-DClus pro- cesses vertexv 10 and, as a result of this processing, another PCC is created for containing verticesv 10 ;v 11 ;v 12 andv 13 . This new PCC was joined with the PCC determined by ver- V v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 V 9 V 10 V 11 V 12 V 13 V 14 V 15 PC V 0 1 0 1 0 0 1 2 2 3 3 3 3 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 v 1 v 5 v 3 v 2 v 4 v 7 v 6 v 8 v 9 7 8 8 7 0 2 4 5 24 034 5 02 2 1 3 6 36 1 12 v 10 v 12 v 11 v 13 9 10 11 12 9 101112 1 9 9 Figure 13: Processing vertexv 10 . V v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 V 9 V 10 V 11 V 12 V 13 V 14 V 15 PC V 0 1 0 1 0 0 1 2 2 3 3 3 3 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 v 1 v 5 v 3 v 2 v 4 v 7 v 6 v 8 v 9 7 8 8 7 0 2 4 5 24 034 5 02 2 1 3 6 36 1 12 v 10 v 12 v 11 v 13 9 10 11 12 9 12 101112 1 9 9 10 Figure 14: Processing vertexv 11 . V v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 V 9 V 10 V 11 V 12 V 13 V 14 V 15 PC V 0 1 0 1 0 0 1 2 2 3 3 3 3 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 v 1 v 5 v 3 v 2 v 4 v 7 v 6 v 8 v 9 7 8 8 7 0 2 4 5 24 034 5 02 2 1 3 6 36 1 12 v 10 v 12 v 11 v 13 9 10 11 12 9 12 101112 1 9 12 9 1011 Figure 15: Processing vertexv 12 . texv 2 , through a dash line, in order to reflect the fact that they are linked since verticesv 2 andv 10 are similar. Furt- hermore, as it can be seen in Figures 14 and 15, this fourth PCC is updated when verticesv 11 andv 12 are processed, in order to reflect the fact that they are similar to vertexv 13 . Finally, a fifth PCC is created when vertexv 14 is processed, 240 Informatica 42 (2018) 229–244 L.J. González-Soler et al. V v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 V 9 V 10 V 11 V 12 V 13 V 14 V 15 PC V 0 1 0 1 0 0 1 2 2 3 3 3 3 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 v 1 v 5 v 3 v 2 v 4 v 7 v 6 v 8 v 9 7 8 8 7 0 2 4 5 24 034 5 02 2 1 3 6 36 1 12 v 10 v 12 v 11 v 13 9 10 11 12 9 12 101112 1 9 12 9 1011 v 14 v 15 13 14 14 13 Figure 16: Processing vertexv 14 . see Figure 16; this PCC contains verticesv 14 andv 15 . Once the array Arr PCC has been updated, CUDA- DClus processes each new PCC following the same stra- tegy commented above, in order to update the current clus- tering. It is important to highlight that, different from when Arr PCC was created, this time the partial connected com- ponents loaded into the system memory are those belon- ging to the connected components determined by each new created PCC; the other partial connected components re- main in the hard drive. Although in the worst scenario an incoming document can be similar to all existing docu- ments in the collection, generally similarity graphs are very sparse so it is expected that the new representation pro- posed by CUDA-DClus as well as the strategy it uses for updating the array of partial connected components, help CUDA-DClus to save system memory. 4 Experimental results In this section, the results of several experiments done in order to show the performance of the CUDA-OClus and CUDA-DClus algorithms are presented. The experiments were conducted over eight document collections and were focused on: (1) assessing the correctness of the proposed parallel algorithms wrt. their original non parallel versi- ons, (2) evaluating the improvement achieved by the pro- posed algorithms with respect to the original OClustR and DClustR algorithms, and (3) evaluating the memory both CUDA-DClus and DClustR consume when they are pro- cessing the same collection. All the algorithms were imple- mented in C++; the codes of OClustR and DClustR algo- rithms were obtained from their authors. For implementing CUDA-OClus and CUDA-DClus the CUDA Toolkit 5.5 was used. All the experiments were performed on a PC with Core i7-4770 processor at 3.40 GHz, 8GB RAM, ha- ving a PCI express NVIDA GeForce GT 635, with 1 GB DRAM. The document collections used in our experiments were built from two benchmark text collections com- monly used in documents clustering: Reuters-v2 and TDT2. The Reuters-v2 can be obtained from http://kdd.ics.uci.edu, while TDT2 benchmark can be obtained from http://www.nist.gov/speech/tests/tdt.html. From these benchmarks, eight document collections were built. The characteristics of these collections are shown in Table 3. As it can be seen from Table 3, these collections are heterogeneous in terms of their size, dimension and the average size of the documents they contain. Coll. #Docs. #Terms Terms/Docs. Reu-10K 10000 33370 27 Reu-20K 20000 48493 41 Reu-30K 30000 59413 50 Reu-40K 40000 70348 58 Reu-50K 50000 74720 64 Reu-60K 60000 81632 69 Reu-70K 70000 91490 76 Tdt-65K 65945 114828 210 Table 3: Overview of the collections used in our experi- ments. In our experiments, documents were represented using the Vector Space Model (VSM) [13]. The index terms of the documents represent the lemmas of the words occurring at least once in the collection; these lemmas were extracted from the documents using Tree-tagger 1 . Stop words such as: articles, prepositions and adverbs were removed. The index terms of each document were statistically weighted using their term frequency. Finally, the cosine measure was used to compute the similarity between two documents [9]. 4.1 Correctness evaluation As it was mentioned before, the first experiment was fo- cused on assessing the correctness of the proposed algo- rithms. With this aim, we will compare the clusterings built by CUDA-OClus and CUDA-DClus with respect to those built by the original OClustR and DClustR algorithms, un- der the same conditions. For evaluating CUDA-OClus we selected the Reu-10K, Reu-20K, Reu-30K, Reu-40K and Reu-50K collections; whilst for evaluating CUDA-DClus we selected Reu-10K, Reu-20K and Reu-30K collections. These collections were selected due to they resemble the collections over which both OClustR and DClustR were evaluated in [6] and [7], respectively. In order to evaluate CUDA-OClus, we executed OClustR and CUDA-OClus over the Reu-10K, Reu-20K, Reu-30K, Reu-40K and Reu-50K collections, using = 0:25 and 0:35. We used these threshold values as these values obtai- ned the best results in several collections as reported in the original OClustR [6] and DClustR [7] articles. Then, we took the clustering results obtained by OClustR as ground truth and we evaluateds the clustering results obtained by CUDA-OClus in terms of their accuracy, using the FBcu- 1 http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 241 bed [14] and the Normalized Mutual Information (NMI) [15] external evaluation measures. FBcubed is one of the external evaluation measures most used for evaluating overlapping clustering algorithms and unlike of other external evaluation metrics, it meets with four fundamental constrains proposed in [14] (cluster ho- mogeneity, cluster completeness, rag bag and cluster size vs quantity). On the other hand, NMI is a measure of si- milarity borrowed from information theory, which has pro- ved to be reliable [15]. Both metrics take values in [0;1], where1 means identical results and0 completely different results. In order to take into account the inherent data order dependency of CUDA-OClus, we executed CUDA-OClus twenty more times over the above mentioned collections, for each parameter value, varying the order of their docu- ments. Table 4 shows the average FBcubed and NMI va- lues attained by CUDA-OClus for each selected collection, using = 0:25 and0:35. FBCubed Threshold Reu-10K Reu-20K Reu-30K Reu-40K Reu-50K =0.25 0.999 0.999 1.000 0.998 1.000 =0.35 0.999 1.000 1.000 1.000 0.999 NMI Threshold Reu-10K Reu-20K Reu-30K Reu-40K Reu-50K =0.25 0.997 0.999 1.000 0.999 1.000 =0.35 0.998 1.000 1.000 1.000 0.999 Table 4: Average FBcubed and NMI values attained by CUDA-OClus for each selected collection. As it can be seen from Table 4, the average FBcubed and NMI values attained by CUDA-OClus are very close to 1, meaning that the clusters CUDA-OClus builds are al- most identical to those built by OClustR. The differences between the clusterings are caused by the inherent data or- der dependency of the algorithms and also because of the different floating point arithmetic used by CUDA. In order to asses the validity of CUDA-DClus, in the se- cond part of the first experiment, we will compare the clus- tering results built by CUDA-DClus with respect to those obtained by DClustR. With this aim, we obtain a ground truth by executing DClustR over the Reu-30K collection, also using = 0:25 and = 0:35, and then, we pro- cess Reu-20K and Reu-10K collections, in this order, as if they were additions of documents to the collection. That is, we are going to add the documents contained in Reu- 20K to the current collection (i.e., Reu-30K) and update the clustering using DClustR and after that, we are goind to add Reu-10K to the collection resulting from previous additions (i.e., Reu-30K union Reu-20K) and update the clustering again. We repeated the above mentioned exe- cution under the same parameter configuration but using CUDA-DClus instead of DClustR and afterwards. Then, we take the results obtained by DClustR as ground truth and we evaluate each of the three clustering results obtai- ned by CUDA-DClus in terms of their accuracy, using the FBcubed and NMI external evaluation measures. Like in the first part of this experiment, we executed CUDA-DClus twenty times under the above mentioned experimental con- figuration, each time varying the order of the documents inside the collections. Table 5 shows the average FBcu- bed and NMI values attained by CUDA-DClus for each se- lected collection, using = 0:25 and0:35. FBCubed Threshold Reu-30K Reu-30K+Reu-20K Reu-30K+Reu- 20K+Reu-10K =0:25 0.999 0.995 0.998 =0:35 0.995 0.996 0.991 NMI Threshold Reu-30K Reu-30K+Reu-20K Reu-30K+Reu- 20K+Reu-10K =0:25 0.998 0.994 0.999 =0:35 0.997 0.998 0.995 Table 5: Average FBcubed and NMI values attained by CUDA-DClus for each selected collection. As it can be seen from Table 5, the average FBcubed and NMI values attained by CUDA-DClus are very close to 1, meaning that the clusters it builds are almost identical to those built by DClustR. From this first experiment, we can conclude that the speed-up attained by CUDA-OClus and CUDA-DClus does not degrade their accuracy wrt. the original non parallel versions. 4.2 Execution time evaluation In the second experiment, we evaluate the time impro- vement achieved by CUDA-OClus and CUDA-DClus with respect to OClusR and DClustR, respectively. With this aim, we execute both OClustR and CUDA-OClus over Reu-10K, Reu-20K, Reu-30K, Reu-40K, Reu-50K, Reu- 60K and Reu-70K, using = 0:25 and 0:35 and we mea- sured the time they spent. Like in the previous experiment, in order to take into account the data order dependency of both algorithms, we repeated the above mentioned execu- tions twenty times, for each collection and each parame- ter configuration, but varying the order of the documents of the collections. Figure 17 shows the average time both OClustR and CUDA-OClus spent for clustering each se- lected collection, for each parameter configuration. As it can be seen from Figure 17, CUDA-OClus is fas- ter than OClustR over each selected dataset and for both values of ; for = 0:25 and = 0:35, CUDA-OClus is respectively 1.26x and 1.29x faster than OClustR. It is important to note from Figure 17 that as the size of the pro- cessed collection grows, the difference in the time spent for each algorithm also grows; this behavior shows how well CUDA-OClus scale when the size of the collection grows. We would like to highlight the fact that the speci- fications of the computer used in the experiments provided advantage to CPU-based algorithms over GPU-based algo- rithms, since a Core i7-4770 processor at 3.40 GHz with 8GB RAM is superior to a PCI express NVIDA GeForce GT 635, with 1 GB DRAM, which only has two streaming processors and a limited memory. Hence, taking into ac- count the execution model of a GPU, in which the grid blocks are numerated and they distributed among all stre- aming multiprocessors, which execute simultaneously one 242 Informatica 42 (2018) 229–244 L.J. González-Soler et al. 0 257 514 771 1028 Reu-10K Reu-20K Reu-30K Reu-40K Reu-50K Reu-60K Reu-70K Time (sec.) Collec!ons OClustR CUDA-Clus (a) = 0:25 0 215 430 645 860 Reu-10K Reu-20K Reu-30K Reu-40K Reu-50K Reu-60K Reu-70K Time (sec.) Collec!ons OClustR CUDA-Clus (b) = 0:35 Figure 17: Time spent by OClustR and CUDA-OClus for clustering the selected experimental datasets, using = 0:25 and0:35. task over a specific block, then we expect that if we use a powerful GPU with more streaming multiprocessor, the difference between the processing time achieved by paral- lel version and sequential version will be higher than the one showed in this experiments. In order to compare both DClustR and CUDA-DClus, in the second part of the second experiment, we clustered the Reu-50K collection using both algorithms and then, we measured the time each algorithm spent for updating the current clustering each timeN documents of Tdt-65K col- lection are incrementally added to the existing collection. In this experiment we also used = 0:25 and 0:35, and we set N = 5000 and N = 1000 which are much grea- ter values than those used to evaluate DClustR [7]. In or- der to take into account the data order dependency of both algorithms, the above mentioned executions were also re- peated twenty times, for each collection and each parame- ter configuration, but varying the order of the documents of the collections. Figure 18 shows the average time both DClustR and CUDA-DClus spent for updating the current clustering, for each parameter configuration. As it can be seen from Figure 18, CUDA-DClus has a better behavior than DClustR, for each parameter configu- 0 200 400 600 5 10 15 20 25 Time (sec.) Objects (1*10 3 ) DClustR CUDA-DClus (a) = 0:25,N = 5000 0 150 300 450 600 5 10 15 20 25 Time (sec.) Objects (1*10 3 ) DClustR CUDA-DClus (b) = 0:35,N = 5000 0 350 700 1050 1400 10 20 30 40 50 Time (sec.) Objects (1*10 3 ) DClustR CUDA-DClus (c) = 0:25,N = 10000 0 350 700 1050 1400 10 20 30 40 50 Time (sec.) Objects (1*10 3 ) DClustR CUDA-DClus (d) = 0:35,N = 10000 Figure 18: Time spent by DClustR and CUDA-DClus for updating the current clustering, using = 0:25 and 0:35, forN = 5000 and10000. ration, when multiple additions are processed over the se- lected dataset, showing an average speed up of 1.25x and 1.29x for = 0:25;N = 5000 and = 0:35;N = 5000 respectively. Moreover, it also showed an average speed up of 1.19x and 1.26x for = 0:25;N = 10000 and Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 243 = 0:35;N = 10000 respectively. As in the first part of this experiment, it can be seen also from Figure 18, that the behavior of CUDA-DClus, with respect to that of DClustR, becomes better as the size of the collection grows; in this way, we can say that CUDA-DClus also scales well as the size of the collection grows. 4.3 Memory use evaluation Although the spatial complexity of both algorithm is O(jVj+ e E ), the strategy CUDA-DClus proposes for re- presenting e G should allow to reduce the amount of me- mory needed to update the clustering each time the col- lection changes. Thus, in the third experiment, we compare the amount of memory used by CUDA-DClus against that used by DClustR, when processing the changes performed in the second experiment. The amount of connected com- ponent loaded by both algorithms when they are updating the current clustering after changes, is directly proportio- nal to the memory used. Based on this, Figure 19 shows the average number of connected components (i.e., Ave. NCC) each algorithm load into system memory, when processing the changes presented in Figure 18, for each parameter con- figuration. As it can be seen from Figure 19, CUDA-DClus consu- mes less memory than DClustR, for each parameter con- figuration, thereby hence resulting the memory usage of CUDA-DClus is respectively 22:43% for = 0:25 and 42:46% for = 0:35 less than the one of DClustR. The above mentioned characteristic, plus the fact that CUDA- DClus is also faster than DClustR, makes CUDA-DClus suitable for applications processing large document col- lections. We would like to highlight that in the worst scenario, if the clustering of all the connected components needs to be updated, all the partial connected components will be loaded to system memory and thus, our proposed CUDA- DClus and DClustR will have a similar behavior. Addi- tionally, taking into account the results of experiments in sections 4.1 and 4.2, we can conclude that the strategy pro- posed for reducing the memory used by CUDA-DClus does not include any considerable cost in the overall processing time of CUDA-DClus or in its accuracy. 5 Conclusions In this paper, we introduced two GPU-based parallel versi- ons of the OClustR and DClustR clustering algorithms, na- mely CUDA-OClus and CUDA-DClus, specifically tailo- red for document clustering. CUDA-OClus proposes a stra- tegy in order to speed up the most time consuming steps of OClustR. This strategy is reused by CUDA-DClus in or- der to speed up the most time consuming steps of DClustR. Moreover, CUDA-DClus proposes a new strategy for re- presenting the graph e G that DClustR uses for represen- ting the collection of documents. This new representation 60 100 140 180 5 10 15 20 25 Ave. NCC Objects (1*10 3 ) DClustR CUDA-DClus (a) = 0:25,N = 5000 500 900 1300 1700 2100 2500 5 10 15 20 25 Ave. NCC Objects (1*10 3 ) DClustR CUDA-DClus (b) = 0:35,N = 5000 100 140 180 220 10 20 30 40 50 Ave. NCC Objects (1*10 3 ) DClustR CUDA-DClus (c) = 0:25,N = 10000 500 1500 2500 3500 10 20 30 40 50 Ave. NCC Objects (1*10 3 ) DClustR CUDA-DClus (d) = 0:35,N = 10000 Figure 19: Average number of connected components DClustR and CUDA-DClus load into system memory when they are updating the current clustering, using = 0:25 and0:35, forN = 5000 and10000. allows CUDA-DClus to reduce the amount of memory it needs to use and also it helps CUDA-DClus to avoid re- building the affected components each time the collection changes but still keep them updated after each changes, 244 Informatica 42 (2018) 229–244 L.J. González-Soler et al. with no extra cost. The proposed parallel algorithms were compared against their original versions, over several standard document collections. The experiments were focused on: (a) as- sess the correctness of the proposed parallel algorithms, (b) evaluate the speed-up achieved by CUDA-OClus and CUDA-DClus with respect to OClustR and DClustR, re- spectively, and (c) evaluate the memory both CUDA-DClus and DClusR consumes when they are processing changes. From the experiments, it can be seen that both CUDA- OClus and CUDA-DClus are faster than OClustR and DClustR, respectively, and that the speed up these parallel versions attain do not degrade their accuracy. The experi- ments also showed that CUDA-DClus consumes less me- mory than DClustR, when both algorithms are processing changes over the same collection. Based on the obtained results, we can conclude that both CUDA-OClus and CUDA-DClus enhance the efficiency of OClustR and DClustR, respectively, in problems dealing with a very large number of documents. These parallel al- gorithms could be useful in applications, like for instance news analysis, information organization and profiles iden- tification, among others. We would like to mention, that even when the proposed parallel algorithms were specifi- cally tailored for processing documents with the purpose of using the cosine measure, the strategy they propose can be easily extended to work with other similarity or distance measures like, for instance, euclidean and manhattan dis- tances. As future work, we are going to explore the use in CUDA-OClus and CUDA-DClus of other types of memo- ries in GPU such as texture memory, which is a faster me- mory than the one both CUDA-OClus and CUDA-DClus are using now. Besides, we are going to evaluate both al- gorithms over more faster GPU cards, in order to have a better insight of the performance of both algorithms when the number of CUDA cores are increased. References [1] E. Bae, J. Bailey, G. Dong, A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings, Data Mining and Knowledge Discovery 21 (3) (2010) 427–471. [2] A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM computing surveys (CSUR) 31 (3) (1999) 264–323. [3] S. Gregory, A fast algorithm to find overlapping com- munities in networks, in: Machine learning and kno- wledge discovery in databases, Springer, 2008, pp. 408–423. [4] J. Aslam, K. Pelekhov, D. Rus, Static and dynamic in- formation organization with star clusters, in: Procee- dings of the seventh international conference on In- formation and knowledge management, ACM, 1998, pp. 208–217. [5] A. Pons-Porrata, R. Berlanga-Llavori, J. Ruiz- Shulcloper, J. M. Pérez-Martínez, Jerartop: A new topic detection system, in: Progress in Pattern Re- cognition, Image Analysis and Applications, Sprin- ger, 2004, pp. 446–453. [6] A. Pérez-Suárez, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, J. E. Medina-Pagola, Oclustr: A new graph-based algorithm for overlapping cluste- ring, Neurocomputing 121 (2013) 234–247. [7] A. Pérez-Suárez, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, J. E. Medina-Pagola, An algo- rithm based on density and compactness for dynamic overlapping clustering, Pattern Recognition 46 (11) (2013) 3040–3055. [8] L. J. G. Soler, A. P. Suárez, L. Chang, Efficient over- lapping document clustering using gpus and multi- core systems, in: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications - 19th Iberoamerican Congress, CIARP 2014, Puerto Vallarta, Mexico, November 2-5, 2014. Proceedings, 2014, pp. 264–271. [9] M. W. Berry, M. Castellanos, Survey of text mining, Computing Reviews 45 (9) (2004) 548. [10] R. Gil-García, A. Pons-Porrata, Dynamic hierarchical algorithms for document clustering, Pattern Recogni- tion Letters 31 (6) (2010) 469–477. [11] J. Sanders, E. Kandrot, CUDA by example: an introduction to general-purpose GPU programming, Addison-Wesley Professional, 2010. [12] D. B. Kirk, W. H. Wen-mei, Programming massively parallel processors: a hands-on approach, Newnes, 2012. [13] G. Salton, A. Wong, C.-S. Yang, A vector space mo- del for automatic indexing, Communications of the ACM 18 (11) (1975) 613–620. [14] E. Amigó, J. Gonzalo, J. Artiles, F. Verdejo, A com- parison of extrinsic clustering evaluation metrics ba- sed on formal constraints, Information retrieval 12 (4) (2009) 461–486. [15] A. Lancichinetti, S. Fortunato, J. Kertész, Detecting the overlapping and hierarchical community structure in complex networks, New Journal of Physics 11 (3) (2009) 033015.