Informatica 42 (2018) 229–244 229
Static and Incremental Overlapping Clustering Algorithms for Large
Collections Processing in GPU
Lázaro Janier González-Soler, Airel Pérez-Suárez and Leonardo Chang
Advanced Technologies Application Center (CENATA V)
7ma A # 21406, Playa, CP: 12200, Havana, Cuba
E-mail: jsoler@ceneatav.co.cu and http://www.cenatav.co.cu/index.php/proﬁle/proﬁle/userproﬁle/jsoler
E-mail: asuarez@cenatav.co.cu and http://www.cenatav.co.cu/index.php/proﬁle/proﬁle/userproﬁle/asuarez
E-mail: lchang@cenatav.co.cu and http://www.cenatav.co.cu/index.php/proﬁle/proﬁle/userproﬁle/lchang
Keywords: data mining, clustering, overlapping clustering, GPU computing
Received: November 18, 2016
Pattern Recognition and Data Mining pose several problems in which, by their inherent nature, it is con-
sidered that an object can belong to more than one class; that is, clusters can overlap each other. OClustR
and DClustR are overlapping clustering algorithms that have shown, in the task of documents clustering,
the better tradeoff between quality of the clusters and efﬁciency, among the existing overlapping clustering
algorithms. Despite the good achievements attained by both aforementioned algorithms, they areO(n
2
)
so they could be less useful in applications dealing with a large number of documents. Moreover, although
DClustR can efﬁciently process changes in an already clustered collection, the amount of memory it uses
could make it not suitable for applications dealing with very large document collections. In this paper, two
GPU-based parallel algorithms, named CUDA-OClus and CUDA-DClus, are proposed in order to enhance
the efﬁciency of OClustR and DClustR, respectively, in problems dealing with a very large number of
documents. The experimental evaluation conducted over several standard document collections showed
the correctness of both CUDA-OClus and CUDA-DClus, and also their better performance in terms of
efﬁciency and memory consumption.
Povzetek: OClustR in DClustR sta prekrivna algoritma za gruˇ cenje, ki dosegata dobre rezultate, vendar
je njuna kompleksnost kvadratnega reda velikosti. V tem prispevku sta predstavljena dva paralelna algo-
ritma, ki temeljita na GPU: CUDA-OClus in CUDA-DClus. V eksperimentih sta pokazala zmožnost dela
z velikimi koliˇ cinami podatkov.
1 Introduction
Clustering is a technique of Machine Learning and Data
Mining that has been widely used in several contexts [1].
This technique aims to structure a data set in clusters or
classes such that objects belonging to the same class are
more similar than objects belonging to different classes [2].
There are several problems that, by their inherent nature,
consider that objects could belong to more than one class
[3, 4, 5]; that is, clusters can overlap each other. Most of the
clustering algorithms developed so far do not consider that
clusters could share elements; however, the desire of ade-
quately target those applications dealing with this problem,
have recently favored the development of overlapping clus-
tering algorithms; i.e., algorithms that allow objects to be-
long to more than one cluster. An overlapping clustering
algorithm that has shown, in the task of documents clus-
tering, the better tradeoff between quality of the clusters
and efﬁciency, among the existing overlapping clustering
algorithms, is OClustR [6]. Despite the good achievements
attained by OClustR in the task of documents clustering, it
has two main limitations:
1. It has a computational complexity of O(n
2
), so it
could be less useful in applications dealing with a
large amount of documents.
2. It assumes that the entire collection is available be-
fore clustering. Thus, when this collection changes it
needs to rebuild the clusters starting from scratch; that
is, OClustR does not use the previously built cluste-
ring for updating the clusters after changes.
In order to overcome the second limitation, the DClustR
algorithm was proposed by Pérez-Suárez et al. in [7].
DClustR introduced a strategy for efﬁciently updating the
clustering after multiple additions and/or deletions from
the collection, making it suitable for handling overlapping
clustering in applications where the collection changes fre-
quently, specially for those applications handling multi-
ple changes at the same time. Nevertheless, DClustR still
suffers from the ﬁrst limitation; that is, like OClustR, it
is O(n
2
). This implies that when the collection grows a
lot, the time that DClustR uses for processing the chan-
ges could make it less useful in real applications. Moreo-
ver, when the collection grows, the memory space used by
230 Informatica 42 (2018) 229–244 L.J. González-Soler et al.
DClustR for storing the data it needs will also grow, ma-
king DClustR a high memory consumer and consequently,
making it not suitable for applications dealing with large
collections. Motivated by the above mentioned facts, in this
work we extend both OClustR and DClustR for efﬁciently
processing very large document collections.
A technique that has been widely used in recent years
in order to speed-up computing tasks is parallel computing
and speciﬁcally, GPU computing. A GPU is a device that
was initially designed for processing algorithms belonging
to the graphical world, but due to its low cost, its high level
of parallelism and its optimized ﬂoating-point operations,
it has been used in many real applications dealing with a
large amount of data.
The main contribution of this paper is the proposal
of two GPU-based parallel algorithms, namely CUDA-
OClus and CUDA-DClus, which enhance the efﬁciency of
OClustR and DClustR, respectively, in problems dealing
with a very large number of documents, like for instance
news analysis, information organization and proﬁles iden-
tiﬁcation, among others.
Preliminary results of this paper were published in [8].
The main differences of this paper with respect to the
conference paper presented in [8] are the following: (1)
we introduce a new GPU-based algorithm, named CUDA-
DClus, which is a parallel version of the DClustR algo-
rithm, that is able to efﬁciently process changes in an alre-
ady clustered collection and to efﬁciently process large col-
lections of documents, and (2) we introduce a strategy for
incrementally building and updating the connected compo-
nents presented in a graph, allowing CUDA-DClus to mi-
nimize the memory needed for processing the whole col-
lection. It is important to highlight that in CUDA-DClus we
only analyze the additions of objects to the collection, be-
cause this is the case in which it could be difﬁcult to apply
DClustR in real applications dealing with large collections,
since this is the case that makes the collection grow.
The remainder of this paper is organized as follows:
in Section 2, a brief description of both the OClustR
and DClustR algorithms are presented. In Section 3, the
CUDA-OClus and CUDA-DClus parallel clustering algo-
rithms are proposed. An experimental evaluation, showing
the performance of both proposed algorithms on several
document collections, is presented in Section 4. Finally,
the conclusions as well as some ideas about future directi-
ons are presented in Section 5.
2 OClustR and DClustR algorithms
In this section, both the OClustR [6] and DClustR [7] al-
gorithms are described. Since DClustR is the extension
of OClustR for efﬁciently processing collections that can
change due to additions, deletions and modiﬁcations, the
OClustR is ﬁrst introduced and then, the strategy used by
DClustR for updating the clustering after changes is pre-
sented. All the deﬁnitions and examples presented in this
section were taken from [6, 7].
2.1 The OClustR algorithm
In order to build a set of overlapping clusters from a col-
lection of objects, OClustR employs a strategy comprised
of three stages. In the ﬁrst stage, the collection of objects is
represented by OClustR as a weighted thresholded simila-
rity graph. Afterwards, in the second stage, an initial set of
clusters is built through a cover of the graph representing
the collection, using a special kind of subgraph. Finally, in
the third stage the ﬁnal set of overlapping clusters is obtai-
ned by improving the initial set of clusters. Following, each
stage is brieﬂy described.
Let O = fo
1
;o
2
;:::;o
n
g be a collection of objects,
  2 [0;1] a similarity threshold, and S:O  O!< a sym-
metric similarity function. A weighted thresholded simila-
rity graph, denoted as
e
G
  =hV;
e
E
  ;Si, is an undirected
and weighted graph such thatV = O and there is an edge
(v;u)2
e
E
  iffS(v;u)    ; each edge(v;u)2
e
E
  ;v6=u
is labeled with the value of S(v;u). As it can be infer-
red, in the ﬁrst stage OClustR must compute the similarity
between each pair of objects; thus, the computational com-
plexity of this stage isO(n
2
).
Once
e
G
  is built, in the second stage OClustR builds
an initial set of clusters through a covering of
e
G
  , using
weighted star-shaped sub-graphs.
Let
e
G
  =hV;
e
E
  ;Si be a weighted thresholded simila-
rity graph. A weighted star-shaped sub-graph (ws-graph)
in
e
G
  , denoted by G
?
=hV
?
;E
?
;Si, is a sub-graph of
e
G
  , having a vertexc2V
?
, called the center ofG
?
, such
that there is an edge betweenc and all the other vertices in
V
?
nfcg; these vertices are called satellites. All vertices in
e
G
  having no adjacent vertices (i.e., isolated vertices) are
considered degenerated ws-graphs.
For building a covering of
e
G
  using ws-graphs, OClustR
must build a setW =fG
?
1
;G
?
2
;:::;G
?
k
g of ws-graphs of
e
G
  , such thatV =
S
k
i=1
V
?
i
, beingV
?
i
;8i = 1:::k, the
set of vertices of the ws-graph G
?
i
. For solving this pro-
blem, OClustR searches for a list C = fc
1
;c
2
;:::;c
k
g,
such thatc
i
2 C is the center ofG
?
i
2 W ,8i = 1::k. In
the following, we will say that a vertex v is covered if it
belongs toC or if it is adjacent to a vertex that belongs to
C. For pruning the search space and for establishing a cri-
terion in order to select the vertices that should be included
inC, the concept of relevance of a vertex is introduced.
The relevance of a vertexv, denoted asv:relevance, is
deﬁned as the average between the relative density and the
relative compactness of a vertexv, denoted asv:densityR
and v:compactnessR, respectively, which are deﬁned as
follows:
v:densityR =
jfu2v:Adj=jv:Adjj j u:Adjjgj
jv:Adjj
;
v:compactnessR =
jfu2v:Adj=AIS(G
?
v
)  AIS(G
?
u
)gj
jv:Adjj
;
wherev:Adj andu:Adj are the set of adjacent vertices
ofv andu, respectively;G
?
v
andG
?
u
are the ws-graphs de-
termined by verticesv andu, andAIS(G
?
v
) andAIS(G
?
u
)
Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 231
are the approximated intra-cluster similarity ofG
?
v
andG
?
u
,
respectively. The approximated intra-cluster similarity of a
ws-graphG
?
is deﬁned as the average weight of the edges
existing inG
?
between its center and its satellites.
Based on the above deﬁnitions, the strategy that OClustR
uses in order to build the listC is composed of three steps.
First, a candidate listL containing the vertices having rele-
vance greater than zero is created; isolated vertices are di-
rectly included inC. Then,L is sorted in decreasing order
of their relevance and each vertexv2 L is visited. Ifv is
not covered yet or it has at least one adjacent vertex that is
not covered yet, thenv is added toC. Each selected vertex,
together with its adjacent vertices, constitutes a cluster in
the initial set of clusters. The second stage of OClustR also
has a computational complexity ofO(n
2
). Figure 1 shows
through an example, the steps performed by OClustR in the
second stage for building the initial set of clusters.
Finally, in the third stage, the ﬁnal clusters are obtained
though a process which aims to improve the initial clusters.
With this aim, OClustR processesC in order to remove the
vertices forming a non-useful ws-graph. A vertexv forms
a non-useful ws-graph if: a) there is at least another ver-
texu2 C such that the ws-graphu determines includesv
as a satellite, andb) the ws-graph determined byv shares
more vertices with other existing ws-graphs than those it
only contains. For removing non useful vertices, OClustR
uses three steps. First, the vertices inC are sorted in des-
cending order according to their number of adjacent verti-
ces. After that, each vertex v 2 C is visited in order to
remove those non-useful ws-graphs determined by vertices
in(v:Adj\C). If a ws-graphG
?
u
, withu2 (v:Adj\C),
is non-useful,u is removed fromC and the satellites it only
covers are “virtually linked” tov by adding them to a list
namedv:Linked; in this way, those vertices virtually lin-
ked to v will also belong to the ws-graph v determines.
Once all vertices in (v:Adj\C) are analyzed, v together
with the vertices inv:Adj andv:Linked constitute a ﬁnal
cluster. This third stage also has a computational complex-
ity ofO(n
2
). Figure 2 shows through an example, how the
ﬁnal clusters are obtained from the initial clusters showed
in Figure 1(d).
2.2 Updating the clusters after changes: the
DClustR algorithm
Let
e
G
  = hV;
e
E
  ;Si be the weighted thresholded simi-
larity graph that represents an already clustered collection
O. LetC =fc
1
;c
2
;:::;c
k
g be the set of vertices repre-
senting the current covering of
e
G
  and consequently, the
current clustering. When some vertices are added to and/or
removed from O (i.e., from
e
G
  ), there could happen the
following two situations:
1) Some vertices become uncovered. This situation
occurs when at least one of the added vertices is unco-
vered or when those vertices ofC covering a speciﬁc
vertex were deleted from
e
G
  .
0.15
0.20
0.30
0.20
0.40
0.30
0.10
0.10
0.10
0.10
0.10
0.20
0.20
0.20
0.20
0.50
0.40
0.20
0.16
0.30
0.50
0.70
0.32
0.40
0.20
0.25
0.20
(a) A weighted thresholded similarity graph
e
G
  0
0
0.9
0.5
0.25
0.3
0.92
0.84
0
0
0
0.84
0
0.17
0.6
0
0.5
0.5
0.5
0
0
0.67
0.5
(b)
e
G
  using relevance for labeling the vertices
0
0
0.9
0.5
0.25
0.3
0.92
0.84
0
0
0.84
0
0.17
0.6
0
0.5
0.5
0.5
0
0
0.67
0
0.5
(c) Vertices belonging to setC
0
0
0.9
0.5
0.25
0.3
0.92
0.84
0
0
0.84
0
0.17
0.6
0
0.5
0.5
0.5
0
0
0.67
0
0.5
(d) Set of initial clusters
Figure 1: Illustration of how OClustR builds the initial set
of clusters.
232 Informatica 42 (2018) 229–244 L.J. González-Soler et al.
0
0
0.9
0.5
0.25
0.3
0.92
0.84
0
0
0.84
0
0.17
0.6
0
0.5
0.5
0.5
0
0
0.67
0
0.5
(a) Vertices determining non-useful ws-graphs (ﬁlled
with light gray)
0
0
0.9
0.5
0.25
0.3
0.92
0.84
0
0
0.84
0
0.17
0.6
0
0.5
0.5
0.5
0
0
0.67
0
0.5
(b) Final set of overlapping clusters
Figure 2: Illustration of how the ﬁnal clusters are obtained
by OClustR in the third stage.
2) The relevance of some vertices changes and, as a con-
sequence, at least one vertexu = 2C appears such that
u has relevance greater than at least one vertex inC
that covers vertices in u:Adj[fug. Vertices like u
could determine ws-graphs with more satellites and
less overlapping with other ws-graphs than other ws-
graphs currently belonging to the covering of
e
G
  .
Figure 3, illustrates the above commented situations over
the graph
e
G
  of Figure 1(a). Figure. 3(a), shows the graph
e
G
  before the changes; the vertices to be removed are mar-
ked with an “x”. Figure 3(b), shows graph
e
G
  after the
changes; vertices ﬁlled with light gray represent the added
vertices. Figures 3(c) and 3(d), show the updated graph
e
G
  with vertices labeled with letters and with their up-
dated value of relevance, respectively; vertices ﬁlled with
black correspond with those vertices currently belonging to
C. As it can be seen from Figures 3(c) and 3(d), vertices
S;F;G;I;H and J became uncovered after the changes,
while vertex B, which does not belong to C, has a rele-
vance greater than vertexD, which already belongs toC.
Taking into account the above mentioned situations, in
order to update the clustering after changes DClustR ﬁrst
detects which are the connected components of
e
G
  that
were affected by changes and then it iteratively updates the
covering of these components and consequently, their clus-
0.15
0.20
0.30
0.20
0.40
0.30
0.10
0.10
0.10
0.10
0.10
0.20
0.20
0.20
0.20
0.50
0.40
0.20
0.16
0.30
0.50
0.70
0.32
0.40
0.20
0.25
0.20
x
x
x
x
x
(a)
e
G
  before the changes
0.15
0.20
0.30
0.20
0.30
0.40
0.16
0.30
0.50
0.32
0.40
0.20
0.25
0.20
0.21
(b)
e
G
  after the changes
A
C
B
D
E
F
G
H
J
K
M
L
N
O
P
Q
I
S
R
(c)
e
G
  with vertices labeled with letters
0
0
0.88
0.67
0.5
0
0
0.75
0.5
0.67
0
0.5
0.5
0.5
0
0
0
0
0.7
(d)
e
G
  with vertices labeled with their updated value
of relevance
Figure 3: Illustration of how some changes in the collection
affect the current covering of the graph
e
G
  of Figure 2(b).
Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 233
tering.
The connected components that are affected by changes
are those that contain vertices that were added or verti-
ces that were adjacent to vertices that were deleted from
e
G
  . Since DClustR has control over these vertices it can
build these components through a depth ﬁrst search, star-
ting from any of these vertices. Let G
0
=hV
0
;E
0
;Si be
a connected component affected by changes, whose cove-
ring must be updated. Let C
0
  C be the set of vertices
of G
0
which determine ws-graphs (i.e., clusters) covering
G
0
. DClustR follows the same principles of OClustR; that
is, it ﬁrst builds or completes the covering ofG
0
in order to
build an initial set of clusters (stage 1) and then, it impro-
ves these clusters in order to build the ﬁnal set of clusters
ofG
0
(stage 2). In fact, DClustR uses the same steps that
OClustR for the above two mentioned stages, but unlike
OClustR, DClustR modiﬁes the way in which the candi-
date listL, used in stage 1, is built.
In order to build candidate listL, DClustR ﬁrst recom-
putes the relevance value of all vertices inG
0
and it empties
the listc:Linked, for all verticesc2C
0
; this last action is
supported by the fact that, after changes, there could be ws-
graphs that were considered as non useful, which could be
no longer so. LetV
+
  (V
0
nC
0
) be the set of vertices of
G
0
with relevance greater than zero, which do not belong
toC
0
. For building the candidate listL, bothC
0
andV
+
are
processed.
For processingV
+
, DClustR visits each vertexv2 V
+
and it veriﬁesa) ifv is uncovered, orb) if at least one ad-
jacent vertex ofv is uncovered, orc) if there is at least one
vertexu2 v:Adj, such that there is no other vertex inC
0
coveringu whose relevance is greater than or equal to the
relevance ofv. If any of these three conditions is fulﬁlled,
v is added to L. Additionally, if the last condition is ful-
ﬁlled, all those vertices likeu are marked as “activated” in
order to use them whenC
0
is being processed. The compu-
tational complexity of the processing ofV
+
isO(n
2
).
For processingC
0
, DClustR visits the adjacent vertices
of each vertexv2C
0
. Any vertexu2v:Adj having grea-
ter relevance thanv is added toL; in these cases,v is addi-
tionally marked as “weak”. Once all the adjacent vertex of
v were visited, ifv was marked as “weak” or at least one of
its adjacent vertices were previously marked as “active”,v
is removed fromC
0
since it could be substituted by a more
relevant vertex. However, ifv has a relevance greater than
zero, it is still considered as a candidate and consequently,
it is added toL. The computational complexity of the pro-
cessing ofC
0
isO(n
2
).
Figure 4, shows the updated set of overlapping clusters
obtained by DClustR when it processes the graph in Fi-
gure 3(d); vertices ﬁlled with black represent the vertices
determining ws-graphs that cover each connected compo-
nent of
e
G
  .
Like OClustR, the computational complexity of
DClustR isO(n
2
).
0
0
0.88
0.67
0.5
0
0
0.75
0.5
0.67
0
0.5
0.5
0.5
0
0
0
0
0.7
Figure 4: Updated set of overlapping clusters obtained by
DClustR.
3 Proposed parallel algorithms
As it was mentioned in Section 1, despite the good achie-
vements attained by OClustR and DClustR in the task of
documents clustering, these algorithms areO(n
2
) so they
could be less useful in applications dealing with a very
large number of documents. Motivated by this fact, in this
section two massively parallel implementations in CUDA
of OClustR and DClustR are proposed in order to enhance
the efﬁciency of OClustR and DClustR in the above menti-
oned problems. These parallel algorithms, namely CUDA-
OClus and CUDA-DClus, take advantage of the beneﬁts of
GPUs, like for instance, the high bandwidth communica-
tion between CPU and GPU, and the GPU memory hierar-
chy.
Although in their original articles both OClustR and
DClustR were proposed as general purpose clustering al-
gorithms, the parallel extensions proposed in this work are
speciﬁcally designed for processing documents. This ap-
plication context is the same in which both OClustR and
DClustR were evaluated and it is also a context in which
very large collections are commonly processed. In the
context of document processing, both CUDA-OClus and
CUDA-DClus use the cosine measure [9] for computing
the similarity between two documents; this measure is the
function that has been used the most for this purpose [10].
The cosine measure between two documents d
i
and d
j
is
deﬁned as:
cos(d
i
;d
j
) =
P
m
k=1
d
i
(k)  d
j
(k)
kd
i
k kd
j
k
; (1)
whered
i
(k) andd
j
(k) are the weights of thek term in the
description of the documentsd
i
andd
j
, respectively;kd
i
k
andkd
j
k are the norms of documentsd
i
andd
j
, respecti-
vely.
In experiments conducted over several document col-
lections, it was veriﬁed that the ﬁrst stage of OClustR, the
construction of the similarity graph, consumes the 99% of
the processing time of the algorithm. The remaining 1%
is mainly dominated by the computation of the relevance
of the vertices. Based on this fact, the above two mentio-
ned steps are the ones that will be implemented in CUDA
234 Informatica 42 (2018) 229–244 L.J. González-Soler et al.
by CUDA-OClus; remaining steps are high memory con-
suming tasks that are more favored with a CPU implemen-
tation. Analogously, in these experiments it was also ve-
riﬁed that the most time consuming steps of DClustR are
the updating of the graph after changes and the recompu-
ting of the relevance, so these steps will be implemented
in CUDA by CUDA-DClus. In this case, it could be no-
ticed also that the detection of the connected components
affected by changes is a high memory consuming task per-
formed by DClustR, so it is also important to address this
problem in CUDA-DClus.
Finally, it is also important to mention that since we
are dealing with the problem of processing very large do-
cument collections, CUDA-DClus only tackles additions,
which are the changes that could increase the size of the
collection. Implementing deletions is irrelevant for overco-
ming problems related with large document collections.
Following, the CUDA-OClus algorithm is ﬁrst introdu-
ced and then, the CUDA-DClust algorithm is presented.
3.1 CUDA-OClus algorithm
LetD =fd
1
;d
2
:::;d
n
g be a collection of documents des-
cribed by a set of terms. LetT =ft
1
;t
2
;:::;t
m
g be the
list containing all the different terms that describe at least
one document inD. CUDA-OClus represents a document
d
i
2 D by two parallel vectors, denoted byT
di
andW
di
.
The ﬁrst one contains the position that the terms describing
d
i
have inT , and the second one contains the weights that
those terms have in the description ofd
i
.
For building
e
G
  =hV;
e
E
  ;Si, OClustR demandsS to
be a symmetric similarity measure, so the similarity bet-
ween any two documents (i.e., vertices in
e
G
  ) needs to be
computed only once. Based on this fact and considering
the inherent order the documents have inside a collection
D (i.e., vertices inV ), for building the edges relatives to a
vertexv2V it is only necessary to compute the similarity
betweenv and each vertex followingv inV . LetSuc
v
be
the list of vertices that follow a vertex v in V . To speed
up the construction of
e
G
  , for each vertexv2V , CUDA-
OClus will compute in parallel the similarity betweenv and
the vertices inSuc
v
.
Considering the deﬁnition of the cosine measure, it can
be seen from Expression (1) that its numerator is a sum of
independent products which could be computed all at once.
On the other hand, taking into account that the norm of a
document can be computed while the document is being
read, the denominator of Expression (1) can be also resol-
ved with no extra time. Based on these facts, CUDA-OClus
also parallelizes the computation of the similarity between
a pair of vertices, in order to speed up even more the con-
struction of
e
G
  .
In order to carry out the previous idea, CUDA-OClus
builds a grid comprised of k square blocks, each block
having a shared memory square matrix (SMM); where
k =
n
t
+1 andt is the dimension of both the blocks and
the matrices. A grid is a logic representation of a matrix
of threads in the GPU. The use of SMM and its low la-
tency will allow CUDA-OClus to not constantly access the
CPU memory, speeding up the calculus of the similarity
between two vertices. CUDA-OClus assigns tot the maxi-
mum value allowed by the architecture of the GPU for the
dimension of a SMM.
When CUDA-OClus is going to compute the similarity
between a vertexv and the vertices inSuc
v
, it ﬁrst builds
a vectorP
v
of sizem. This vector has zero in all its entries
excepting in those expressed by the positions stored inT
v
;
these last entries contain their respective weights stored in
W
v
. Once P
v
has been built, the list Suc
v
is partitioned
intok sublists. Each one of these sublists is assigned to a
block constituting the grid and the SMM associated with
that block is emptied; i.e., all its cells are set to zero. When
a sublist Q =fv
1
;v
2
;:::;v
p
g is assigned to a block in-
side a grid, each vertex inQ is assigned to a column of the
block. In this context, to assign a vertexv
i
to a column me-
ans that each row of the column points to a term describing
v
i
; in this way, the jth row points to the jth term descri-
bingv
i
. Figure 5 shows an example of how the listSuc
v
is
divided by CUDA-OClus intok sublists and how these su-
blists are assigned to the blocks constituting the grid. The
example on Figure 5 shows how the ﬁrst vertex of the su-
blist assigned to “block 0” is assigned to the ﬁrst column
of that block; the other assignments could be deduced from
this example.
…
Block 0 Block 1 Block k
Grid
2
3
5
8
10
11
12
14
v
1
v
2
v
t
V
t+1
V
t+2
V
2t
V
n
Suc
v
0.5
0.3
0.4
0.8
0.2
0.6
0.3
0.3
T
v 1
W
v 1
T
v 2
W
v 2
…
…
T
v n
W
v n
t=5
Figure 5: Illustration of how CUDA-OClus divides Suc
v
and assigns each resulting sublist to the blocks.
Each row inside a column of a block has a thread that
performs a set of operations. In our case, the threads asso-
ciated with theith column will compute the similarity bet-
weenv and its assigned vertexv
i
. With this aim, the thread
associated with each row inside theith column will com-
pute the product between the weight that the term pointed
by that row has in the description ofv
i
, and the weight this
same term has in the description of vertexv. It is important
to note that although the sum in the numerator of Expres-
sion (1) runs over all the terms inT , the products that will
be different from zero are only those between terms shared
Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 235
by both documents; this is the reason we only use the terms
ofv
i
and multiply their weights by the weights that these
terms have inv; remaining terms inv are useless.
Given that thejth row of the column to which vertexv
i
has been assigned, points to thejth term ofT
vi
, the weight
this term has inv
i
is stored at thejth position ofW
vi
and
the weight this same term has in v is stored at P
v
, in the
entry referred by the value stored at thejth position ofT
vi
.
The result of the product between the above mentioned
weights is added to the value the jth row already has in
the SMM. If the description of a vertex v
i
assigned to a
column of a block exceeds the length of the column (i.e.,
t) a tiling is applied at this block. Tiling [11] is a technique
that consists on dividing a data set into a number of small
subsets, such that each subset ﬁts into the block; i.e., the
SMM. Thus, when the rows of a column point at the nextt
terms, the products between the weights these terms have
in the description of v
i
and v are computed and accumu-
lated into the values these rows have in the SMM. This
technique is applied until all the terms describing the verti-
ces assigned to the columns have been processed. Figure 6
shows how the similarity between the vertex v
1
assigned
to the ﬁrst column of “Block 0” and v is computed. In
this example, it has been assumed that there are 15 terms
describing the documents of the collection, the size of
the block is k = 5, and T
v
= f1;2;5;8;10;12;14g,
W
v
= f0:2;0:6;0:3;0:7;0:2;0:1;0:5g,
T
v1
= f2;3;5;8;10;11;12;14g and W
v1
=
f0:5;0:3;0:4;0:8;0:2;0:6;0:3;0:3g.
As it can be seen from Figure 6(a), each thread of thet
rows of the ﬁrst column computes the product between the
weight of the term it points at, and the weight this same
term has inP
v
(i.e., the description ofv). As it was menti-
oned before, the computed products are stored in the SMM
of that block. Note from Figure 6(a) that the product com-
puted by the second row is zero since vertex v does not
contain the term pointed out by this row; i.e., term having
index 3rd inT . Figure 6(b) shows how when Tiling is ap-
plied, the remaining terms describingv
1
are pointed by the
rows of the ﬁrst column. Figure 6(c) shows how the pro-
ducts between the remaining terms ofv
1
andv are perfor-
med. Finally, Figure 7 shows which are the values stored
in the ﬁrst column of the SMM of “Block 0”, once all the
products have been computed.
Once all the terms describing the vertices assigned to a
block have been processed, a reduction is applied over each
column of the block. Reduction [12] is an operation that
computes in parallel the sum of all the values of a column
of the SMM and then, it stores this sum in the ﬁrst row of
the column. Figure 8 shows the ﬁnal sum obtained for the
ﬁrst column of “Block 0”.
The sum obtained on the column to with vertex v
i
has
been assigned corresponds with the numerator of the co-
sine measure betweenv andv
i
. This sum is then divided
by the product of the norms ofv andv
i
, which have been
previously computed; the result of this division (i.e., the si-
milarity betweenv andv
i
) is copied to the CPU. Using this
0.3
0
0.12
0.56
0.4
Block 0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
P
v
v
1
v
2
v
t
Suc
v
T
v 2
W
v 2
…
0
0.2
0.6
0
0
0.3
0
0
0.7
0
0.2
0
0.1
0
0.5
t=5
2
3
5
8
10
11
12
14
0.5
0.3
0.4
0.8
0.2
0.6
0.3
0.3
T
v 1
W
v 1
(a) Computing the ﬁrstt products
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
P
v
0
0.2
0.6
0
0
0.3
0
0
0.7
0
0.2
0
0.1
0
0.5
0.3
0
0.12
0.56
0.4
Block 0
v
1
v
2
v
t
Suc
v
T
v 2
W
v 2
…
t=5
2
3
5
8
10
11
12
14
0.5
0.3
0.4
0.8
0.2
0.6
0.3
0.3
T
v 1
W
v 1
(b) Applying Tiling
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
P
v
0
0.2
0.6
0
0
0.3
0
0
0.7
0
0.2
0
0.1
0
0.5
0.3
0
0.12
0.56
0.4
v
1
v
2
v
t
Suc
v
T
v 2
W
v 2
…
t=5
2
3
5
8
10
11
12
14
0.5
0.3
0.4
0.8
0.2
0.6
0.3
0.3
T
v 1
W
v 1
Block 0
(c) Computing remaining products
Figure 6: Illustration of how CUDA-OClus computes the
similarity between a vertexv and the vertices inSuc
v
.
result CUDA-OClus decides if it should create or not an
edge betweenv andv
i
, during this step CUDA-OClus also
updates the value ofAIS(v) andAIS(v
i
).
236 Informatica 42 (2018) 229–244 L.J. González-Soler et al.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
P
v
0
0.2
0.6
0
0
0.3
0
0
0.7
0
0.2
0
0.1
0
0.5
0.3
0.03
0.27
0.56
0.4
v
1
v
2
v
t
Suc
v
T
v 2
W
v 2
…
t=5
2
3
5
8
10
11
12
14
0.5
0.3
0.4
0.8
0.2
0.6
0.3
0.3
T
v 1
W
v 1
Block 0
Figure 7: Final results stored in the SMM after processing
all terms ofv
1
.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
P
v
0
0.2
0.6
0
0
0.3
0
0
0.7
0
0.2
0
0.1
0
0.5
1.56
v
1
v
2
v
t
Suc
v
T
v 2
W
v 2
…
t=5
2
3
5
8
10
11
12
14
0.5
0.3
0.4
0.8
0.2
0.6
0.3
0.3
T
v 1
W
v 1
Block 0
Figure 8: Result of applying Reduction on the ﬁrst column
of “Block 0”.
The pseudocode of cosine similarity function is shown
in Algorithm 1.
Once the thresholded similarity graph
e
G
  has been built,
CUDA-OClus speeds up the computation of the other time-
consuming step: the computation of the relevance of the
vertices. In order to do that, CUDA-OClus computes in pa-
rallel the relevance of all the vertices of
e
G
  . Moreover, for
each vertexv, CUDA-OClus computes in parallel the con-
tribution each adjacent vertex ofv has over the relevance of
v, speeding up even more the computation of the relevance
ofv. In order to accomplish this idea, the list of vertices of
e
G
  is partitioned intok sublists and each sublist is assigned
to a block inside a grid. However, in this case, when a ver-
texv
i
of a sublist is assigned to a column of a block, each
row in that column will point to an adjacent vertex of v
i
;
e.g., thejth row points at thejth adjacent vertex ofv
i
. Dif-
Algorithm 1: CUDA implementation of the cosine
similarity function.
Input:Sucv the list of vertices that follow a vertexv,Pv weights of
vertexv,Wv weights associated tov,Tv position of terms that
represent tov,Normv is the norm ofv
Output:similarity: cosine similarity values betweenv andSucv
1 __shared__floatSMM[R][C] ; // R = C because block
are squared
2 inttid=threadIdx:x+blockDim:x  blockIdx:x;
3 if (tid<jSucvj) then
4 u =Sucv[tid];
5 inttidy =threadIdx:y;
6 floatsum=0;
7 while (tidy <jWuj) do
/
*
Accumulating the multiplication between
weights of v and u
*
/
8 sum +=Wu[tidy]  Pv[Tu[tidy]];
9 tidy +=R ; // Applying tiling technique
10 SMM[threadIdx:y][threadIdx:x]=sum;
/
*
Waiting that whole threats compute
multiplications between weights of v and
u2Sucv *
/
11 __syncthreads();
/
*
Applying reduction technique to calculate
P
m
k=1
di(k)  dj(k)
*
/
12 inti=R=2;
13 while (i!= 0) do
14 if (threadIdx:y<i) then
15 SMM[threadIdx:y][threadIdx:x] +=
SMM[threadIdx:y+i][threadIdx:x];
16 __syncthreads();
17 i=i=2;
18 if (threadIdx:y ==0&&tid<jSucvj) then
19 similarity[tid]=0;
20 if (Normv >0&&Normu >0) then
/
*
Dividing between the multiplication of
norms of v and u
*
/
21 similarity[tid]=
SMM[0][threadIdx:x]=(Normv  Normu)
ferent from building graph
e
G
  , now the threads associated
with a column will compute the relevance of its assigned
vertex. With this aim, the thread on each row of that co-
lumn will compute the contributions the vertex pointed by
that row has over the relevance of the vertex assigned to the
column.
Letv be a vertex assigned to a column andu one of its
adjacent vertices. Vertexu contributes
1
jv:Adjj
to the rele-
vance of v ifjv:Adjj j u:Adjj; otherwise, its contribu-
tion is zero. This case represents the contributionu has to
the relevance of v through the relative density of v. On
the other hand, u contributes
1
jv:Adjj
to the relevance ofv
ifAIS(v)  AIS(u); otherwise, its contribution is zero.
This other case represents the contributionu has to the re-
levance of v through the relative compactness of v. The
total contribution provided by a vertex is added to the va-
lue the row already has in the SMM; similar to the case of
building graph
e
G
  , the SMM of each block is initially emp-
tied. Ifv has more thant adjacent vertex, a Tiling is app-
lied. Once all the adjacent vertices ofv has been processed,
a Reduction is applied in order to compute the relevance of
v. Obtained values are then copied to the CPU.
The pseudocode of cosine similarity function is shown
in Algorithm 2.
As it was mentioned before, the remaining steps of
OClustR were not implemented in CUDA because they are
Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 237
Algorithm 2: CUDA implementation of the rele-
vance function.
Input:
e
G
  weights threshold similarity graph,AIS(
e
G
  ) is the
approximated intra-cluster similarity of
e
G
  Output:relevance: relevance values of vertices
1 __shared__floatSMM[R][C] ; // R = C because block
are squared
2 inttid=threadIdx:x+blockDim:x  blockIdx:x;
3 if (tid<jVj) then
4 inttidy =threadIdx:y;
5 floatsum=0;
6 while (tidy <jAdj[tid]j) do
/
*
Checking if the density and compactness
conditions are met
*
/
7 if (jAdj[tidy]j j Adj[tid]j) then
8 sum+=1;
9 if (jAIS[tidy]j j AIS[tid]j) then
10 sum+=1;
11 tidy +=R ; // Applying tiling technique
12 SMM[threadIdx:y][threadIdx:x]=sum;
/
*
Waiting that whole threats check density and
compactness conditions
*
/
13 __syncthreads();
/
*
Applying reduction technique to calculate
relevance
*
/
14 inti=R=2;
15 while (i!= 0) do
16 if (threadIdx:y<i) then
17 SMM[threadIdx:y][threadIdx:x] +=
SMM[threadIdx:y+i][threadIdx:x];
18 __syncthreads();
19 i=i=2;
20 if (threadIdx:y ==0&&tid<jVj) then
21 relevance[tid]=0;
22 if (jAdj[tid]>0j) then
/
*
Dividing between the number of
adjacents of the current vertex
*
/
23 relevance[tid]=
SMM[threadIdx:y][threadIdx:x]=(2 j Adj[tid]j);
more favored with a CPU implementation since they are
high memory consumption tasks.
3.2 CUDA-DClus algorithm
In order to update an already clustered collection when
changes take effect, in our case additions, DClustR ﬁrst de-
tects, in the graph
e
G
  representing the collection, which
are the connected components that were affected by chan-
ges and then, it updates the cover of those components and
consequently, the overall clustering of the collection.
As it was stated in Section 2.2, the connected compo-
nents affected by additions are those containing at least one
added vertex. Thus, each time vertices are added to
e
G
  , in
addition to computing the similarity between these vertices
and those already belonging to
e
G
  in order to create the
respective edges, DClustR also needs to build from scra-
tch each affected connected component in order to update
their covers. In order to reduce the amount of informa-
tion DClustR needs to store in memory, CUDA-DClus pro-
poses to represent the graph
e
G
  using an array of partial
connected components, namedArr
PCC
, and two parallel
arrays. The ﬁrst of these parallel arrays, named V , con-
tains the vertices in the order in which they were added to
e
G
  . The second array, named PC
V
, contains the index
of the partial connected component to which each vertex
belongs. This new representation allows CUDA-DClus to
not need to rebuild the affected components each time the
collection changes, but keeping the affected components
updated each time vertices are added to the graph
e
G
  , with
no extra cost.
Let
e
G
  =hV;
e
E
  ;Si be the thresholded similarity graph
representing the collection of documents. A partial con-
nected component (PCC) in
e
G
  is a connected subgraph
induced by a subset of vertices of
e
G
  . A partial connected
component is represented using two arrays: one array con-
taining the indexes the vertices belonging to that compo-
nent have in
e
G
  , and the other array containing the adja-
cency list of the aforementioned vertices.
The array of partial connected components representing
e
G
  is built once while
e
G
  is being constructed. The stra-
tegy used by CUDA-DClus for this purpose is as follows.
In the ﬁrst step, CUDA-DClus adds a vertex inV for each
document of the collection and then,PC
V
is emptied (i.e.,
it is ﬁlled with -1), meaning that the vertices do not be-
long to any PCC yet. In the second step, CUDA-DClus
processes each vertex v
i
2 V . If v
i
does not belong yet
to a PCC, CUDA-DClus creates a new PCC and it putsv
i
in this component; when a vertexv is added to a PCC, the
index this PCC has in the array Arr
PCC
is stored in the
array PC
V
, at the entry referred to by the index v has in
arrayV ; this is the way CUDA-DClus uses to indicate that
now v belongs to a PCC. Following, CUDA-DClus com-
putes the similarity betweenv
i
and the vertices inSuc
vi
,
using the strategy proposed by CUDA-OClus. Once these
similarities have been computed, CUDA-DClus visits each
vertexu2 Suc
vi
. IfS(v
i
;u)    andu does not belong
to any PCC yet, thenu is inserted in the PCC to whichv
i
belongs and the adjacency lists of both vertices u and v
i
are modiﬁed in order to indicate they are similar to each
other; otherwise, if u already belongs to a PCC only the
adjacency lists of both vertices are modiﬁed. In this last
case, if the partial connected components to which bothv
i
andu belong are not the same, we will say that these partial
connected components are linked.
As an example, let
e
G
  be initially empty and D =
fd
1
;d
2
;:::;d
9
g be the set of documents that will be added
to the collection. For the sake of simplicity, we will assume
that CUDA-DClus already added the documents in
e
G
  as
vertices and that the similarities existing between each pair
of documents are those showed in Table 1. Taking into ac-
count the above mentioned information, Figures 9, 10, 11
and 12 exemplify how CUDA-DClus builds the array of
partial connected components representing graph
e
G
  , for
  = 0:3.
As it can be seen from Figure 9, ﬁrstly CUDA-DClus
processes vertexv
1
in order to build the ﬁrst PCC. As the
result of the above process verticesv
3
andv
5
are added to
the ﬁrst PCC, which is now constituted by vertices v
1
;v
3
and v
5
. The second PCC is built when vertex v
2
is pro-
cessed, see Figure 10; this component is ﬁnally constitu-
ted by verticesv
2
;v
4
andv
7
. Afterwards, as it can be seen
238 Informatica 42 (2018) 229–244 L.J. González-Soler et al.
Vert./Vert. v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
v
9
v
1
- 0 0.4 0 0.5 0 0 0 0
v
2
0 - 0 0.4 0 0 0.5 0 0
v
3
0.4 0 - 0.7 0.6 0.3 0 0 0
v
4
0 0 0.7 - 0 0 0 0 0
v
5
0.5 0 0.6 0 - 0 0 0 0
v
6
0 0 0.3 0 0 - 0 0 0
v
7
0 0 0 0 0 0 - 0 0
v
8
0 0 0 0 0 0 0 - 0.5
v
9
0 0 0 0 0 0 0 0.5 -
Table 1: Similarities existing between each pair of vertices
of the example.
V
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
v
9
PC
V
0
-1
0
-1
0
-1
-1
-1
-1
0
1
2
3
4
5
6
7
8
v
1
v
5
v
3
0
2
4
24
0
0
Figure 9: Processing vertexv
1
.
V
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
v
9
PC
V
0
1
0
1
0
-1
1
-1
-1
0
1
2
3
4
5
6
7
8
v
2
v
4
v
7
1
3
6
36
1
1
v
1
v
5
v
3
0
2
4
24
0
0
Figure 10: Processing vertexv
2
.
from Figure 11, when vertexv
3
is being processed, CUDA-
DClus updates the ﬁrst PCC by adding vertexv
6
and upda-
ting the adjacency list of verticesv
3
andv
5
; CUDA-DClus
also updates the second PCC by modifying the adjacency
list of vertexv
4
, which is similar to vertexv
3
. In this exam-
ple, these two partial connected components were joined by
a dash line in order to illustrate the fact that they are linked
since vertices v
3
, belonging to the ﬁrst PCC, and v
4
, be-
longing to the second PCC, are similar. Finally, the third
PCC is created when CUDA-DClus processes vertex v
8
,
V
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
v
9
PC
V
0
1
0
1
0
0
1
-1
-1
0
1
2
3
4
5
6
7
8
v
1
v
5
v
3
v
2
v
4
v
7
v
6
0
2
4
5
24
034 5
02
1
3
6
36
1
12
2
Figure 11: Processing vertexv
3
.
V
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
v
9
PC
V
0
1
0
1
0
0
1
2
2
0
1
2
3
4
5
6
7
8
v
1
v
5
v
3
v
2
v
4
v
7
v
6
v
8
v
9
7
8
8
7
0
2
4
5
24
034 5
02
2
1
3
6
36
1
12
Figure 12: Processing vertexv
8
.
as it can be seen in Figure 12. The processing of vertices
v
4
;v
5
;v
6
;v
7
and v
9
does not affect the partial connected
components built so far, therefore, it was not included in
the example.
We would like to emphasize two facts about the above
commented process. The ﬁrst fact is that, since this is the
ﬁrst time the array Arr
PCC
representing
e
G
  is built, all
these components are already in system memory. The se-
cond fact is that if we put a PPC P
i
2 Arr
PCC
into a
setQ
Pi
and then, iteratively we add toQ
Pi
all the linked
PCC of each PCC belonging to Q
Pi
, the resulting set is
a connected component. Proof is straightforward by con-
struction. Hereinafter, we will say that Q
Pi
is the con-
nected component induced by PCCP
i
.
Once the array Arr
PCC
representing
e
G
  was built,
CUDA-DClus processes each of its partial connected com-
ponents in order to build the clustering. For processing
a PCC P
i
2 Arr
PCC
that has not been processed in a
previous iteration, CUDA-DClus ﬁrst buildsQ
Pi
and then,
CUDA-DClus recomputes the relevance of the vertices be-
Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 239
longing to this component using the strategy proposed by
CUDA-OClus. Once the relevance of the vertices have
been recomputed, CUDA-DClus follows the same steps
used by DClustR for updating the covering and conse-
quently, the clustering ofQ
Pi
. Remaining steps of DClustR
were not implemented in CUDA because they are more fa-
vored with a CPU implementation. Once the clustering has
been updated, CUDA-DClus stores the existing partial con-
nected components in the hard drive, releasing in this way
the system memory.
Once
e
G
  changes due to the additions of documents to
the collection, CUDA-DClus updates the array Arr
PCC
representing
e
G
  and then, it updates the current cluste-
ring. In order to update the arrayArr
PCC
, CUDA-DClus
adds for each incoming document, a vertex in
e
G
  and
then, CUDA-DClus sets to -1 the entries that these verti-
ces occupy in PC
V
, in order to express that they do not
belong to any PCC yet. LetM =fv
1
;v
2
;:::;v
k
g be the
set of added vertices. Afterwards, for processing a ver-
texv
i
2M, CUDA-DClus slightly modiﬁes the strategy it
uses for creating the partial connected components. Now,
rather than computing the similarity ofv
i
only with the ver-
tices that came after v
i
in V (i.e., Suc
vi
), CUDA-DClus
also computes the similarity ofv
i
with respect to the ver-
tices that belong to
e
G
  before the changes; that is, the si-
milarities are now computed betweenv
i
and each vertex in
Suc
vi
[(VnM). Remaining steps are the same.
Let D
1
=fd
10
;d
11
;:::;d
15
g be the set of documents
that were added to the collection represented by graph
e
G
  ,
whose array of partial connected components was built in
Figure 9, and letv
10
;v
11
;:::;v
15
be the vertices that were
consequently added in
e
G
  by CUDA-DClus. For the sake
of simplicity, in the example it is assumed that none of the
vertices belonging to
e
G
  before the changes is similar to
the added vertices, with the only exception of v
2
whose
similarity with v
10
is 0:5. Table 2 shows the similarities
between each pair of the added vertices. Figures 13, 14, 15
and 16 show, assuming  = 0:3, how CUDA-DClus upda-
tes the array of partial connected components representing
e
G
  after the above mentioned additions. In these ﬁgures,
vertices ﬁlled with light gray are those that were added to
the collection.
v
10
v
11
v
12
v
13
v
14
v
15
v
10
- 0.4 0.3 0.6 0 0
v
11
0.4 - 0 0.4 0 0
v
12
0.3 0 - 0.4 0 0
v
13
0.6 0.4 0.4 - 0 0
v
14
0 0 0 0 - 0.5
v
15
0 0 0 0 0.5 -
Table 2: Similarities existing between each pair of added
vertices.
As it can be seen in Figure 13, ﬁrstly, CUDA-DClus pro-
cesses vertexv
10
and, as a result of this processing, another
PCC is created for containing verticesv
10
;v
11
;v
12
andv
13
.
This new PCC was joined with the PCC determined by ver-
V
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
V
9
V
10
V
11
V
12
V
13
V
14
V
15
PC
V
0
1
0
1
0
0
1
2
2
3
3
3
3
-1
-1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
v
1
v
5
v
3
v
2
v
4
v
7
v
6
v
8
v
9
7
8
8
7
0
2
4
5
24
034 5
02
2
1
3
6
36
1
12
v
10
v
12
v
11
v
13
9
10
11
12
9
101112 1
9
9
Figure 13: Processing vertexv
10
.
V
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
V
9
V
10
V
11
V
12
V
13
V
14
V
15
PC
V
0
1
0
1
0
0
1
2
2
3
3
3
3
-1
-1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
v
1
v
5
v
3
v
2
v
4
v
7
v
6
v
8
v
9
7
8
8
7
0
2
4
5
24
034 5
02
2
1
3
6
36
1
12
v
10
v
12
v
11
v
13
9
10
11
12
9 12
101112 1
9
9 10
Figure 14: Processing vertexv
11
.
V
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
V
9
V
10
V
11
V
12
V
13
V
14
V
15
PC
V
0
1
0
1
0
0
1
2
2
3
3
3
3
-1
-1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
v
1
v
5
v
3
v
2
v
4
v
7
v
6
v
8
v
9
7
8
8
7
0
2
4
5
24
034 5
02
2
1
3
6
36
1
12
v
10
v
12
v
11
v
13
9
10
11
12
9 12
101112 1
9 12
9 1011
Figure 15: Processing vertexv
12
.
texv
2
, through a dash line, in order to reﬂect the fact that
they are linked since verticesv
2
andv
10
are similar. Furt-
hermore, as it can be seen in Figures 14 and 15, this fourth
PCC is updated when verticesv
11
andv
12
are processed, in
order to reﬂect the fact that they are similar to vertexv
13
.
Finally, a ﬁfth PCC is created when vertexv
14
is processed,
240 Informatica 42 (2018) 229–244 L.J. González-Soler et al.
V
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
V
9
V
10
V
11
V
12
V
13
V
14
V
15
PC
V
0
1
0
1
0
0
1
2
2
3
3
3
3
-1
-1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
v
1
v
5
v
3
v
2
v
4
v
7
v
6
v
8
v
9
7
8
8
7
0
2
4
5
24
034 5
02
2
1
3
6
36
1
12
v
10
v
12
v
11
v
13
9
10
11
12
9 12
101112 1
9 12
9 1011
v
14
v
15
13
14
14
13
Figure 16: Processing vertexv
14
.
see Figure 16; this PCC contains verticesv
14
andv
15
.
Once the array Arr
PCC
has been updated, CUDA-
DClus processes each new PCC following the same stra-
tegy commented above, in order to update the current clus-
tering. It is important to highlight that, different from when
Arr
PCC
was created, this time the partial connected com-
ponents loaded into the system memory are those belon-
ging to the connected components determined by each new
created PCC; the other partial connected components re-
main in the hard drive. Although in the worst scenario
an incoming document can be similar to all existing docu-
ments in the collection, generally similarity graphs are very
sparse so it is expected that the new representation pro-
posed by CUDA-DClus as well as the strategy it uses for
updating the array of partial connected components, help
CUDA-DClus to save system memory.
4 Experimental results
In this section, the results of several experiments done in
order to show the performance of the CUDA-OClus and
CUDA-DClus algorithms are presented. The experiments
were conducted over eight document collections and were
focused on: (1) assessing the correctness of the proposed
parallel algorithms wrt. their original non parallel versi-
ons, (2) evaluating the improvement achieved by the pro-
posed algorithms with respect to the original OClustR and
DClustR algorithms, and (3) evaluating the memory both
CUDA-DClus and DClustR consume when they are pro-
cessing the same collection. All the algorithms were imple-
mented in C++; the codes of OClustR and DClustR algo-
rithms were obtained from their authors. For implementing
CUDA-OClus and CUDA-DClus the CUDA Toolkit 5.5
was used. All the experiments were performed on a PC
with Core i7-4770 processor at 3.40 GHz, 8GB RAM, ha-
ving a PCI express NVIDA GeForce GT 635, with 1 GB
DRAM.
The document collections used in our experiments
were built from two benchmark text collections com-
monly used in documents clustering: Reuters-v2
and TDT2. The Reuters-v2 can be obtained from
http://kdd.ics.uci.edu, while TDT2 benchmark can be
obtained from http://www.nist.gov/speech/tests/tdt.html.
From these benchmarks, eight document collections were
built. The characteristics of these collections are shown in
Table 3. As it can be seen from Table 3, these collections
are heterogeneous in terms of their size, dimension and the
average size of the documents they contain.
Coll. #Docs. #Terms Terms/Docs.
Reu-10K 10000 33370 27
Reu-20K 20000 48493 41
Reu-30K 30000 59413 50
Reu-40K 40000 70348 58
Reu-50K 50000 74720 64
Reu-60K 60000 81632 69
Reu-70K 70000 91490 76
Tdt-65K 65945 114828 210
Table 3: Overview of the collections used in our experi-
ments.
In our experiments, documents were represented using
the Vector Space Model (VSM) [13]. The index terms of
the documents represent the lemmas of the words occurring
at least once in the collection; these lemmas were extracted
from the documents using Tree-tagger
1
. Stop words such
as: articles, prepositions and adverbs were removed. The
index terms of each document were statistically weighted
using their term frequency. Finally, the cosine measure was
used to compute the similarity between two documents [9].
4.1 Correctness evaluation
As it was mentioned before, the ﬁrst experiment was fo-
cused on assessing the correctness of the proposed algo-
rithms. With this aim, we will compare the clusterings built
by CUDA-OClus and CUDA-DClus with respect to those
built by the original OClustR and DClustR algorithms, un-
der the same conditions. For evaluating CUDA-OClus we
selected the Reu-10K, Reu-20K, Reu-30K, Reu-40K and
Reu-50K collections; whilst for evaluating CUDA-DClus
we selected Reu-10K, Reu-20K and Reu-30K collections.
These collections were selected due to they resemble the
collections over which both OClustR and DClustR were
evaluated in [6] and [7], respectively.
In order to evaluate CUDA-OClus, we executed OClustR
and CUDA-OClus over the Reu-10K, Reu-20K, Reu-30K,
Reu-40K and Reu-50K collections, using   = 0:25 and
0:35. We used these threshold values as these values obtai-
ned the best results in several collections as reported in the
original OClustR [6] and DClustR [7] articles. Then, we
took the clustering results obtained by OClustR as ground
truth and we evaluateds the clustering results obtained by
CUDA-OClus in terms of their accuracy, using the FBcu-
1
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger
Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 241
bed [14] and the Normalized Mutual Information (NMI)
[15] external evaluation measures.
FBcubed is one of the external evaluation measures most
used for evaluating overlapping clustering algorithms and
unlike of other external evaluation metrics, it meets with
four fundamental constrains proposed in [14] (cluster ho-
mogeneity, cluster completeness, rag bag and cluster size
vs quantity). On the other hand, NMI is a measure of si-
milarity borrowed from information theory, which has pro-
ved to be reliable [15]. Both metrics take values in [0;1],
where1 means identical results and0 completely different
results. In order to take into account the inherent data order
dependency of CUDA-OClus, we executed CUDA-OClus
twenty more times over the above mentioned collections,
for each parameter value, varying the order of their docu-
ments. Table 4 shows the average FBcubed and NMI va-
lues attained by CUDA-OClus for each selected collection,
using  = 0:25 and0:35.
FBCubed
Threshold Reu-10K Reu-20K Reu-30K Reu-40K Reu-50K
  =0.25 0.999 0.999 1.000 0.998 1.000
  =0.35 0.999 1.000 1.000 1.000 0.999
NMI
Threshold Reu-10K Reu-20K Reu-30K Reu-40K Reu-50K
  =0.25 0.997 0.999 1.000 0.999 1.000
  =0.35 0.998 1.000 1.000 1.000 0.999
Table 4: Average FBcubed and NMI values attained by
CUDA-OClus for each selected collection.
As it can be seen from Table 4, the average FBcubed
and NMI values attained by CUDA-OClus are very close
to 1, meaning that the clusters CUDA-OClus builds are al-
most identical to those built by OClustR. The differences
between the clusterings are caused by the inherent data or-
der dependency of the algorithms and also because of the
different ﬂoating point arithmetic used by CUDA.
In order to asses the validity of CUDA-DClus, in the se-
cond part of the ﬁrst experiment, we will compare the clus-
tering results built by CUDA-DClus with respect to those
obtained by DClustR. With this aim, we obtain a ground
truth by executing DClustR over the Reu-30K collection,
also using   = 0:25 and   = 0:35, and then, we pro-
cess Reu-20K and Reu-10K collections, in this order, as if
they were additions of documents to the collection. That
is, we are going to add the documents contained in Reu-
20K to the current collection (i.e., Reu-30K) and update
the clustering using DClustR and after that, we are goind
to add Reu-10K to the collection resulting from previous
additions (i.e., Reu-30K union Reu-20K) and update the
clustering again. We repeated the above mentioned exe-
cution under the same parameter conﬁguration but using
CUDA-DClus instead of DClustR and afterwards. Then,
we take the results obtained by DClustR as ground truth
and we evaluate each of the three clustering results obtai-
ned by CUDA-DClus in terms of their accuracy, using the
FBcubed and NMI external evaluation measures. Like in
the ﬁrst part of this experiment, we executed CUDA-DClus
twenty times under the above mentioned experimental con-
ﬁguration, each time varying the order of the documents
inside the collections. Table 5 shows the average FBcu-
bed and NMI values attained by CUDA-DClus for each se-
lected collection, using  = 0:25 and0:35.
FBCubed
Threshold Reu-30K Reu-30K+Reu-20K Reu-30K+Reu-
20K+Reu-10K
  =0:25 0.999 0.995 0.998
  =0:35 0.995 0.996 0.991
NMI
Threshold Reu-30K Reu-30K+Reu-20K Reu-30K+Reu-
20K+Reu-10K
  =0:25 0.998 0.994 0.999
  =0:35 0.997 0.998 0.995
Table 5: Average FBcubed and NMI values attained by
CUDA-DClus for each selected collection.
As it can be seen from Table 5, the average FBcubed
and NMI values attained by CUDA-DClus are very close
to 1, meaning that the clusters it builds are almost identical
to those built by DClustR. From this ﬁrst experiment, we
can conclude that the speed-up attained by CUDA-OClus
and CUDA-DClus does not degrade their accuracy wrt. the
original non parallel versions.
4.2 Execution time evaluation
In the second experiment, we evaluate the time impro-
vement achieved by CUDA-OClus and CUDA-DClus with
respect to OClusR and DClustR, respectively. With this
aim, we execute both OClustR and CUDA-OClus over
Reu-10K, Reu-20K, Reu-30K, Reu-40K, Reu-50K, Reu-
60K and Reu-70K, using  = 0:25 and 0:35 and we mea-
sured the time they spent. Like in the previous experiment,
in order to take into account the data order dependency of
both algorithms, we repeated the above mentioned execu-
tions twenty times, for each collection and each parame-
ter conﬁguration, but varying the order of the documents
of the collections. Figure 17 shows the average time both
OClustR and CUDA-OClus spent for clustering each se-
lected collection, for each parameter conﬁguration.
As it can be seen from Figure 17, CUDA-OClus is fas-
ter than OClustR over each selected dataset and for both
values of   ; for   = 0:25 and   = 0:35, CUDA-OClus
is respectively 1.26x and 1.29x faster than OClustR. It is
important to note from Figure 17 that as the size of the pro-
cessed collection grows, the difference in the time spent
for each algorithm also grows; this behavior shows how
well CUDA-OClus scale when the size of the collection
grows. We would like to highlight the fact that the speci-
ﬁcations of the computer used in the experiments provided
advantage to CPU-based algorithms over GPU-based algo-
rithms, since a Core i7-4770 processor at 3.40 GHz with
8GB RAM is superior to a PCI express NVIDA GeForce
GT 635, with 1 GB DRAM, which only has two streaming
processors and a limited memory. Hence, taking into ac-
count the execution model of a GPU, in which the grid
blocks are numerated and they distributed among all stre-
aming multiprocessors, which execute simultaneously one
242 Informatica 42 (2018) 229–244 L.J. González-Soler et al.
0
257
514
771
1028
Reu-10K Reu-20K Reu-30K Reu-40K Reu-50K Reu-60K Reu-70K
Time (sec.)
Collec!ons
OClustR
CUDA-Clus
(a)   = 0:25
0
215
430
645
860
Reu-10K Reu-20K Reu-30K Reu-40K Reu-50K Reu-60K Reu-70K
Time (sec.)
Collec!ons
OClustR
CUDA-Clus
(b)   = 0:35
Figure 17: Time spent by OClustR and CUDA-OClus for
clustering the selected experimental datasets, using   =
0:25 and0:35.
task over a speciﬁc block, then we expect that if we use
a powerful GPU with more streaming multiprocessor, the
difference between the processing time achieved by paral-
lel version and sequential version will be higher than the
one showed in this experiments.
In order to compare both DClustR and CUDA-DClus,
in the second part of the second experiment, we clustered
the Reu-50K collection using both algorithms and then, we
measured the time each algorithm spent for updating the
current clustering each timeN documents of Tdt-65K col-
lection are incrementally added to the existing collection.
In this experiment we also used   = 0:25 and 0:35, and
we set N = 5000 and N = 1000 which are much grea-
ter values than those used to evaluate DClustR [7]. In or-
der to take into account the data order dependency of both
algorithms, the above mentioned executions were also re-
peated twenty times, for each collection and each parame-
ter conﬁguration, but varying the order of the documents
of the collections. Figure 18 shows the average time both
DClustR and CUDA-DClus spent for updating the current
clustering, for each parameter conﬁguration.
As it can be seen from Figure 18, CUDA-DClus has a
better behavior than DClustR, for each parameter conﬁgu-
0
200
400
600
5 10 15 20 25
Time (sec.)
Objects (1*10
3
)
DClustR
CUDA-DClus
(a)   = 0:25,N = 5000
0
150
300
450
600
5 10 15 20 25
Time (sec.)
Objects (1*10
3
)
DClustR
CUDA-DClus
(b)   = 0:35,N = 5000
0
350
700
1050
1400
10 20 30 40 50
Time (sec.)
Objects (1*10
3
)
DClustR
CUDA-DClus
(c)   = 0:25,N = 10000
0
350
700
1050
1400
10 20 30 40 50
Time (sec.)
Objects (1*10
3
)
DClustR
CUDA-DClus
(d)   = 0:35,N = 10000
Figure 18: Time spent by DClustR and CUDA-DClus for
updating the current clustering, using  = 0:25 and 0:35,
forN = 5000 and10000.
ration, when multiple additions are processed over the se-
lected dataset, showing an average speed up of 1.25x and
1.29x for  = 0:25;N = 5000 and  = 0:35;N = 5000
respectively. Moreover, it also showed an average speed
up of 1.19x and 1.26x for   = 0:25;N = 10000 and
Static and Incremental Overlapping Clustering Algorithms for. . . Informatica 42 (2018) 229–244 243
  = 0:35;N = 10000 respectively. As in the ﬁrst part of
this experiment, it can be seen also from Figure 18, that the
behavior of CUDA-DClus, with respect to that of DClustR,
becomes better as the size of the collection grows; in this
way, we can say that CUDA-DClus also scales well as the
size of the collection grows.
4.3 Memory use evaluation
Although the spatial complexity of both algorithm is
O(jVj+
      e
E
        ), the strategy CUDA-DClus proposes for re-
presenting
e
G
  should allow to reduce the amount of me-
mory needed to update the clustering each time the col-
lection changes. Thus, in the third experiment, we compare
the amount of memory used by CUDA-DClus against that
used by DClustR, when processing the changes performed
in the second experiment. The amount of connected com-
ponent loaded by both algorithms when they are updating
the current clustering after changes, is directly proportio-
nal to the memory used. Based on this, Figure 19 shows the
average number of connected components (i.e., Ave. NCC)
each algorithm load into system memory, when processing
the changes presented in Figure 18, for each parameter con-
ﬁguration.
As it can be seen from Figure 19, CUDA-DClus consu-
mes less memory than DClustR, for each parameter con-
ﬁguration, thereby hence resulting the memory usage of
CUDA-DClus is respectively 22:43% for   = 0:25 and
42:46% for   = 0:35 less than the one of DClustR. The
above mentioned characteristic, plus the fact that CUDA-
DClus is also faster than DClustR, makes CUDA-DClus
suitable for applications processing large document col-
lections.
We would like to highlight that in the worst scenario,
if the clustering of all the connected components needs to
be updated, all the partial connected components will be
loaded to system memory and thus, our proposed CUDA-
DClus and DClustR will have a similar behavior. Addi-
tionally, taking into account the results of experiments in
sections 4.1 and 4.2, we can conclude that the strategy pro-
posed for reducing the memory used by CUDA-DClus does
not include any considerable cost in the overall processing
time of CUDA-DClus or in its accuracy.
5 Conclusions
In this paper, we introduced two GPU-based parallel versi-
ons of the OClustR and DClustR clustering algorithms, na-
mely CUDA-OClus and CUDA-DClus, speciﬁcally tailo-
red for document clustering. CUDA-OClus proposes a stra-
tegy in order to speed up the most time consuming steps of
OClustR. This strategy is reused by CUDA-DClus in or-
der to speed up the most time consuming steps of DClustR.
Moreover, CUDA-DClus proposes a new strategy for re-
presenting the graph
e
G
  that DClustR uses for represen-
ting the collection of documents. This new representation
60
100
140
180
5 10 15 20 25
Ave. NCC
Objects (1*10
3
)
DClustR
CUDA-DClus
(a)   = 0:25,N = 5000
500
900
1300
1700
2100
2500
5 10 15 20 25
Ave. NCC
Objects (1*10
3
)
DClustR
CUDA-DClus
(b)   = 0:35,N = 5000
100
140
180
220
10 20 30 40 50
Ave. NCC
Objects (1*10
3
)
DClustR
CUDA-DClus
(c)   = 0:25,N = 10000
500
1500
2500
3500
10 20 30 40 50
Ave. NCC
Objects (1*10
3
)
DClustR
CUDA-DClus
(d)   = 0:35,N = 10000
Figure 19: Average number of connected components
DClustR and CUDA-DClus load into system memory
when they are updating the current clustering, using   =
0:25 and0:35, forN = 5000 and10000.
allows CUDA-DClus to reduce the amount of memory it
needs to use and also it helps CUDA-DClus to avoid re-
building the affected components each time the collection
changes but still keep them updated after each changes,
244 Informatica 42 (2018) 229–244 L.J. González-Soler et al.
with no extra cost.
The proposed parallel algorithms were compared against
their original versions, over several standard document
collections. The experiments were focused on: (a) as-
sess the correctness of the proposed parallel algorithms,
(b) evaluate the speed-up achieved by CUDA-OClus and
CUDA-DClus with respect to OClustR and DClustR, re-
spectively, and (c) evaluate the memory both CUDA-DClus
and DClusR consumes when they are processing changes.
From the experiments, it can be seen that both CUDA-
OClus and CUDA-DClus are faster than OClustR and
DClustR, respectively, and that the speed up these parallel
versions attain do not degrade their accuracy. The experi-
ments also showed that CUDA-DClus consumes less me-
mory than DClustR, when both algorithms are processing
changes over the same collection.
Based on the obtained results, we can conclude that both
CUDA-OClus and CUDA-DClus enhance the efﬁciency of
OClustR and DClustR, respectively, in problems dealing
with a very large number of documents. These parallel al-
gorithms could be useful in applications, like for instance
news analysis, information organization and proﬁles iden-
tiﬁcation, among others. We would like to mention, that
even when the proposed parallel algorithms were speciﬁ-
cally tailored for processing documents with the purpose
of using the cosine measure, the strategy they propose can
be easily extended to work with other similarity or distance
measures like, for instance, euclidean and manhattan dis-
tances.
As future work, we are going to explore the use in
CUDA-OClus and CUDA-DClus of other types of memo-
ries in GPU such as texture memory, which is a faster me-
mory than the one both CUDA-OClus and CUDA-DClus
are using now. Besides, we are going to evaluate both al-
gorithms over more faster GPU cards, in order to have a
better insight of the performance of both algorithms when
the number of CUDA cores are increased.
References
[1] E. Bae, J. Bailey, G. Dong, A clustering comparison
measure using density proﬁles and its application to
the discovery of alternate clusterings, Data Mining
and Knowledge Discovery 21 (3) (2010) 427–471.
[2] A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering:
a review, ACM computing surveys (CSUR) 31 (3)
(1999) 264–323.
[3] S. Gregory, A fast algorithm to ﬁnd overlapping com-
munities in networks, in: Machine learning and kno-
wledge discovery in databases, Springer, 2008, pp.
408–423.
[4] J. Aslam, K. Pelekhov, D. Rus, Static and dynamic in-
formation organization with star clusters, in: Procee-
dings of the seventh international conference on In-
formation and knowledge management, ACM, 1998,
pp. 208–217.
[5] A. Pons-Porrata, R. Berlanga-Llavori, J. Ruiz-
Shulcloper, J. M. Pérez-Martínez, Jerartop: A new
topic detection system, in: Progress in Pattern Re-
cognition, Image Analysis and Applications, Sprin-
ger, 2004, pp. 446–453.
[6] A. Pérez-Suárez, J. F. Martínez-Trinidad, J. A.
Carrasco-Ochoa, J. E. Medina-Pagola, Oclustr: A
new graph-based algorithm for overlapping cluste-
ring, Neurocomputing 121 (2013) 234–247.
[7] A. Pérez-Suárez, J. F. Martínez-Trinidad, J. A.
Carrasco-Ochoa, J. E. Medina-Pagola, An algo-
rithm based on density and compactness for dynamic
overlapping clustering, Pattern Recognition 46 (11)
(2013) 3040–3055.
[8] L. J. G. Soler, A. P. Suárez, L. Chang, Efﬁcient over-
lapping document clustering using gpus and multi-
core systems, in: Progress in Pattern Recognition,
Image Analysis, Computer Vision, and Applications
- 19th Iberoamerican Congress, CIARP 2014, Puerto
Vallarta, Mexico, November 2-5, 2014. Proceedings,
2014, pp. 264–271.
[9] M. W. Berry, M. Castellanos, Survey of text mining,
Computing Reviews 45 (9) (2004) 548.
[10] R. Gil-García, A. Pons-Porrata, Dynamic hierarchical
algorithms for document clustering, Pattern Recogni-
tion Letters 31 (6) (2010) 469–477.
[11] J. Sanders, E. Kandrot, CUDA by example: an
introduction to general-purpose GPU programming,
Addison-Wesley Professional, 2010.
[12] D. B. Kirk, W. H. Wen-mei, Programming massively
parallel processors: a hands-on approach, Newnes,
2012.
[13] G. Salton, A. Wong, C.-S. Yang, A vector space mo-
del for automatic indexing, Communications of the
ACM 18 (11) (1975) 613–620.
[14] E. Amigó, J. Gonzalo, J. Artiles, F. Verdejo, A com-
parison of extrinsic clustering evaluation metrics ba-
sed on formal constraints, Information retrieval 12 (4)
(2009) 461–486.
[15] A. Lancichinetti, S. Fortunato, J. Kertész, Detecting
the overlapping and hierarchical community structure
in complex networks, New Journal of Physics 11 (3)
(2009) 033015.