Informatica 40 (2016) 169–180 169 A Multi-Criteria Document Clustering Method Based on Topic Modeling and Pseudoclosure Function Quang Vu Bui CHArt Laboratory EA 4004, Ecole Pratique des Hautes Etudes, PSL Research University, 75014, Paris, France Hue University of Sciences, Vietnam E-mail: quang-vu.bui@etu.ephe.fr Karim Sayadi CHArt Laboratory EA 4004, Ecole Pratique des Hautes Etudes, PSL Research University, 75014, Paris, France Sorbonne University, UPMC Univ Paris 06, France E-mail: karim.sayadi@upmc.fr Marc Bui CHArt Laboratory EA 4004, Ecole Pratique des Hautes Etudes, PSL Research University, 75014, Paris, France E-mail: marc.bui@ephe.sorbonne.fr Keywords: pretopology, pseudoclosure, clustering, k-means, latent Dirichlet allocation, topic modeling, Gibbs sampling Received: July 1, 2016 We address in this work the problem of document clustering. Our contribution proposes a novel unsuper- vised clustering method based on the structural analysis of the latent semantic space. Each document in the space is a vector of probabilities that represents a distribution of topics. The document membership to a cluster is computed taking into account two criteria: the major topic in the document (qualitative criterion) and the distance measure between the vectors of probabilities (quantitative criterion). We perform a struc- tural analysis on the latent semantic space using the Pretopology theory that allows us to investigate the role of the number of clusters and the chosen centroids, in the similarity between the computed clusters. We have applied our method to Twitter data and showed the accuracy of our results compared to a random choice number of clusters. Povzetek: Predstavljena metoda grupira dokumente glede na semantični prostor. Eksperimenti so narejeni na podatkih s Twitterja. 1 Introduction Classifying a set of documents is a standard problem ad- dressed in machine learning and statistical natural language processing [13]. Text-based classification (also known as text categorization) examines the computer-readable ASCII text and investigates linguistic properties to cate- gorize the text. When considered as a machine learning problem, it is also called statistical Natural Language Pro- cessing (NLP) [13]. In this task, a text is assigned to one or more predefined class labels (i.e category) through a specific process in which a classifier is built, trained on a set of features and then applied to label future incoming texts. Given the labels, the task is performed within the supervised learning framework. Several Machine Learn- ing algorithms have been applied to text classification (see [1] for a survey): Rocchio’s Algorithm, N-Nearest Neigh- bors, Naive Bayes, Decision tree, Support Vector Machine (SVM). Text-based features are typically extracted from the so- called word space model that uses distributional statistics to generate high-dimensional vector spaces. Each document is represented as a vector of word occurrences. The set of documents is represented by a high-dimensional sparse matrix. In the absence of predefined labels, the task is re- ferred as a clustering task and is performed within the un- supervised learning framework. Given a set of keywords, one can use the angle between the vectors as a measure of similarity between the documents. Depending on the al- gorithm, different measures are used. Nevertheless, this approach suffers from the curse of dimensionality because of the sparse matrix that represents the textual data. One of the possible solutions is to represent the text as a set of top- ics and use the topics as an input for a clustering algorithm. To group the documents based on their semantic content, the topics need to be extracted. This can be done using one of the following three methods. (i) LSA [10] (Latent Se- mantic Analysis) uses the Singular Value Decomposition methods to decompose high-dimensional sparse matrix to three matrices: one matrix that relates words to topics, an- other one that relates topics to documents and a diagonal 170 Informatica 40 (2016) 169–180 Q.V. Bui et al. matrix of singular value. (ii) Probabilistic LSA [8] is a probabilistic model that treats the data as a set of obser- vations coming from a generative model. The generative model includes hidden variables representing the probabil- ity distribution of the words in the topics and the proba- bility distribution of the topics in the words. (iii) Latent Dirichlet Allocation [4] (LDA) is a Bayesian extension of probabilistic LSA. It defines a complete generative model with a uniform prior and full Bayesian estimator. LDA gives us three latent variables after computing the posterior distribution of the model; the topic assignment, the distribution of words in each topic and the distribution of the topics in each document. Having the distribution of topics in documents, we can use it as the input for cluster- ing algorithms such as k-means, hierarchical clustering. K-means uses a distance measure to group a set of data points within a predefined random number of clus- ters. Thus, to perform a fine-grained analysis of the clus- tering process we need to control the number of clusters and the distance measure. The Pretopology theory [3] of- fers a framework to work with categorical data, to estab- lish a multi-criteria distance for measuring the similarity between the documents and to build a process to structure the space [11] and infer the number of clusters for k-means. We can then tackle the problem of clustering a set of doc- uments by defining a family of binary relationships on the topic-based contents of the documents. The documents are not only grouped together using a measure of similarity but also using the pseudoclosure function built from a family of binary relationships between the different hidden semantic contents (i.e topics). The idea of using Pretopology theory for k-means clus- tering has been proposed by [16]. In this paper, the au- thors proposed the method to find automatically a num- ber k of clusters and k centroids for k-means clustering by results from structural analysis of minimal closed sub- sets algorithm [11] and also proposed to use pseudoclosure distance constructed from the relationships family to exam- ine the similarity measure for both numeric and categorical data. The authors illustrated the method with a toy exam- ple about the toxic diffusion between 16 geographical areas using only one relationship. For the problem of classification, the authors of this work [2] built a vector space with Latent Semantic Anal- ysis (LSA) and used the pseudoclosure function from Pre- topological theory to compare all the cosine values between the studied documents represented by vectors and the doc- uments in the labeled categories. A document is added to a labeled categories if it has a maximum cosine value. Our work differs from the work of [2] and extends the method proposed in [16] with two directions: first, we exploited this idea in document clustering and integrated structural information from LDA using the pretopological concepts of pseudoclosure and minimal closed subsets in- troduced in [11]. Second, we showed that Pretopology the- ory can apply to multi-criteria clustering by defining the pseudo distance built from multi-relationships. In our pa- per, we clustered documents by using two criteria: one based on the major topic of document (qualitative crite- rion) and the other based on Hellinger distance (quantita- tive criterion). The clustering is based on these two crite- ria but not on multicriteria optimization [5] for clustering algorithms.Our application on Twitter data also proposed a method to construct a network from the multi-relations network by choosing the set of relations and then applying strong or weak Pretopology. We present our approach in a method that we named the Method of Clustering Documents using Pretopology and Topic Modeling (MCPTM). MCPTM organizes a set of un- structured entities in a number of clusters based on multi- ple relationships between each two entities. Our method discovers the topics expressed by the documents, tracks changes step by step over time, expresses similarity based on multiple criteria and provides both quantitative and qualitative measures for the analysis of the document. 1.1 Contributions The contributions of this paper are as follows. 1. We propose a new method to cluster text documents using Pretopology and Topic Modeling. 2. We investigate the role of the number of clusters in- ferred by our analysis of the documents and the role of the centroids in the similarity between the computed clusters. 3. We conducted experiments with different distances measures and show that the distance measure that we introduced is competitive. 1.2 Outline The article is organized as follows: Section 2, 3 present some basic concepts such as Latent Dirichlet Allocation (Section 2) and the Pretopology theory (Section 3), Sec- tion 4 explains our approach by describing at a high level the different parts of our algorithm. In Section 5, we ap- ply our algorithm to a corpus consisting of microblogging posts from Twitter.com. We conclude our work in Section 6 by presenting the obtained results. 2 Topic modeling Topic Modeling is a method for analyzing large quantities of unlabeled data. For our purposes, a topic is a proba- bility distribution over a collection of words and a topic model is a formal statistical relationship between a group of observed and latent (unknown) random variables that specifies a probabilistic procedure to generate the topics [4, 8, 6, 15]. In many cases, there exists a semantic relation- ship between terms that have high probability within the A Multi-Criteria Document Clustering Method Based on. . . Informatica 40 (2016) 169–180 171 same topic – a phenomenon that is rooted in the word co- occurrence patterns in the text and that can be used for in- formation retrieval and knowledge discovery in databases. 2.1 Latent Dirichlet allocation wd,n z θd φk α β word n document d topic k Figure 1: Bayesian Network (BN) of Latent Dirichlet Al- location. Latent Dirichlet Allocation (LDA) by Blei et al. [4] is a generative probabilistic model for collections of grouped discrete data. Each group is described as a random mixture over a set of latent topics where each topic is a discrete dis- tribution over the vocabulary collection. LDA is applicable to any corpus of grouped discrete data. In our work we re- fer to the standard Natural Language Processing (NLP) use case where a corpus is a collection of documents, and the discrete data are represented by the occurrence of words. LDA is a probabilistic model for unsupervised learning, it can be seen as a Bayesian extension of the probabilis- tic Latent Semantic Analysis (pLSA) [8]. More precisely, LDA defines a complete generative model which is a full Bayesian estimator with a uniform prior while pLSA pro- vides a Maximum Likelihood (ML) or Maximum a Poste- rior (MAP) estimator. For more technical details we refer to the work of Gregor Heinrich [7]. The generative model of LDA is described with the probabilistic graphical model [9] in Fig. 1. In this LDA model, different documents d have different topic proportions θd. In each position in the document, a topic z is then selected from the topic proportion θd. Fi- nally, a word is picked from all vocabularies based on their probabilities φk in that topic z. θd and φk are two Dirichlet distributions with α and β as hyperparameters. We assume symmetric Dirichlet priors with α and β having a single value. The hyperparameters specify the nature of the priors on θd and φk. The hyperparameter α can be interpreted as a prior observation count of the number of times a topic z is sampled in document d [15]. The hyper hyperparameter β can be interpreted as a prior observation count on the number of times words w are sampled from a topic z [15]. The advantage of the LDA model is that interpreting at the topic level instead of the word level allows us to gain more insights into the meaningful structure of documents since noise can be suppressed by the clustering process of words into topics. Consequently, we can use the topic pro- portion in order to organize, search, and classify a collec- tion of documents more effectively. 2.2 Inference with Gibbs sampling In this subsection, we specify a topic model procedure based on the Latent Dirichlet Allocation (LDA) and Gibbs Sampling. The key problem in Topic Modeling is posterior infer- ence. This refers to reversing the defined generative pro- cess and learning the posterior distributions of the latent variables in the model given the observed data. In LDA, this amounts solving the following equation: p(θ, φ, z|w,α, β) = p(θ, φ, z, w|α, β) p(w|α, β) (1) Unfortunately, this distribution is intractable to compute [7]. The normalization factor in particular, p(w|α, β), can- not be computed exactly. However, there are a number of approximate inference techniques available that we can ap- ply to the problem including variational inference (as used in the original LDA paper [4]) and Gibbs Sampling that we shall use. For LDA, we are interested in the proportions of the topic in a document represented by the latent variable θd, the topic-word distributions φ(z), and the topic index as- signments for each word zi. While conditional distribu- tions - and therefore an LDA Gibbs Sampling algorithm - can be derived for each of these latent variables, we note that both θd and φ(z) can be calculated using just the topic index assignments zi (i.e. z is a sufficient statistic for both these distributions). Therefore, a simpler algorithm can be used if we integrate out the multinomial parameters and simply sample zi. This is called a collapsed Gibbs sampler [6, 15]. The collapsed Gibbs sampler for LDA needs to compute the probability of a topic z being assigned to a word wi, given all other topic assignments to all other words. Some- what more formally, we are interested in computing the fol- lowing posterior up to a constant: p(zi | z−i, α, β, w) (2) where z−i means all topic allocations except for zi. Equation 3 shows how to compute the posterior distribu- tion for topic assignment. P (zi = j|z−i, w) ∝ nwi−i,j + β n (·) −i,j + V β ndi−i,j + α ndi−i,· +Kα (3) where nwi−i,j is the number of times word wi was related to topic j. n(·)−i,j is the number of times all other words 172 Informatica 40 (2016) 169–180 Q.V. Bui et al. Algorithm 1 The LDA Gibbs sampling algorithm. Require: words w ∈ corpusD = (d1, d2, . . . , dM ) 1: procedure LDA-GIBBS(w, α, β, T ) 2: randomly initialize z and increment counters 3: loop for each iteration 4: loop for each word w in corpusD 5: Begin 6: word← w[i] 7: tp← z[i] 8: nd,tp− = 1;nword,tp− = 1;ntp− = 1 9: loop for each topic j ∈ {0, . . . , K − 1} 10: compute P (zi = j|z−i, w) 11: tp← sample from p(z|.) 12: z[i]← tp 13: nd,tp+ = 1;nword,tp+ = 1;ntp+ = 1 14: End 15: Compute φ(z) 16: Compute θd 17: return z, φ(z), θD . Output 18: end procedure were related with topic j. ndi−i,j is the number of times topic j was related with document di. The number of times all other topics were related with document di is annotated with ndi−i,·. Those notations were taken from the work of Thomas Griffiths and Mark Steyvers [6]. φ̂ (w) j = n (w) j + β n (·) j + V β (4) θ̂ (d) j = n (d) j + α n (d) · +Kα (5) Equation 4 is the Bayesian estimation of the distribution of the words in a topic. Equation 5 is the estimation of the distribution of topics in a document. 3 Pretopology theory The Pretopology is a mathematical modeling tool for the concept of proximity. It was first developed in the field of social sciences for analyzing discrete spaces [3]. The Pre- topology establishes powerful tools for conceiving a pro- cess to structure the space and infer the number of clusters for example. This is made possible by ensuring a follow-up of the process development of dilation, alliance, adherence, closed subset and acceptability [16, 12]. 3.1 Pseudoclosure Let consider a nonempty set E and P(E) which designates all the subsets of E. Definition 1. A pseudoclosure a(.) is a mapping from P(E) to P(E), which satisfies following two conditions: a(∅) = ∅;∀A ⊂ E,A ⊂ a(A) (6) A pretopological space (E, a) is a set E endowed with a pseudoclosure function a.(). Subset a(A) is called the pseudoclosure of A. As a(a(A)) is not necessarily equal to a(A), a sequential ap- pliance of pseudoclosure on A can be used to model expan- sions: A ⊂ a(A) ⊂ a(a(A)) = a2(A) ⊂ . . . ⊂ ak(A) Figure 2: Iterated application of the pseudoclosure map leading to the closure. Definition 2. Let (E, a) a pretopological space, ∀A,A ⊂ E. A is a closed subset if and only if a(A) = A. Definition 3. Given a pretopological space (E, a), call the closure of A, when it exists, the smallest closed subset of (E, a) which contains A. The closure of A is denoted by F (A). Remark: – F (A) is the intersection of all closed subsets which contain A. In the case where (X, a) is a “general” pre- topological space, the closure may not exist. – Closure is very important because of the information it gives about the “influence” or “reachability” of a set, meaning, for example, that a set A can influence or reach elements into F (A), but not further (see Figure 2). Hence, it is necessary to build a pretopological spaces in which the closure always exists. V -type pretopological spaces are the most interesting cases. Definition 4. A Pretopology space (E, a) is called V-type space if and only if ∀A ⊂ E,∀B ⊂ E, (A ⊂ B)⇒ (a(A) ⊂ a(B)) (7) Proposition 1. In any pretopological space of type V , given a subset A of E, the closure of A always exists. The other reason why we use the spaces of type V is that we can build them from a family of reflexive binary relations on the finite set E. That thus makes it possible to take various points of view (various relations) expressed in a qualitative way to determine the pretopological struc- ture placed on E. So, it can be applied on multi-criteria clustering or multi-relations networks. 3.2 Pretopology and binary relationships Suppose we have a family (Ri)i=1,...,n of binary reflex- ive relationships on a finite set E. Let us consider ∀i = 1, 2, . . . , n, ∀x ∈ E, Vi(x) defined by: Vi(x) = {y ∈ E|xRi y} (8) Then, the pseudoclosure as(.) is defined by: as(A) = {x ∈ E|∀i = 1, 2, . . . , n, Vi(x) ∩A 6= ∅} (9) A Multi-Criteria Document Clustering Method Based on. . . Informatica 40 (2016) 169–180 173 Pretopology defined on E by as(.) using the intersection operator is called the strong Pretopology induced by the family (Ri)i=1,...,n. Similarly, we can define weak Pretopology from aw(.) by using the union operator: aw(A) = {x ∈ E|∃i = 1, 2, . . . , n, Vi(x) ∩A 6= ∅} (10) Proposition 2. as(.) and aw(.) determine on E a pretopo- logical structure and the spaces (E, as), (E, aw) are of V - type. 3.3 Minimal closed subsets We denote Fe as the family of elementary closed subsets, the set of closures of each singleton {x} of P (E). So in a V -type pretopological space, we get: - ∀x ∈ E,∃Fx : closure of {x}. - Fe = {Fx|x ∈ E} Definition 5. Fmin is called a minimal closed subset if and only if Fmin is a minimal element for inclusion in Fe. We denote Fm = {Fmj , j = 1, 2, . . . , k}, the family of minimal closed subsets, the set of minimal closed subsets in Fe. 4 Our approach In our approach, we build The Method of Clustering Doc- uments using Pretopology and Topic Modeling (MCPTM) which clusters documents via Topic Modeling and pseudo- closure. MCPTM can be built by: 1. Defining the topic-distribution of each document di in corpus D by document structure analysis using LDA. 2. Defining two binary relationships: RMTP based on major topic and RdH based on Hellinger distance. 3. Building the pseudoclosure function from two binary relationships RMTP , RdH . 4. Building the pseudoclosure distance from pseudoclo- sure function. 5. Determining initial parameters for the k-means algo- rithm from results of minimal closed subsets. 6. Using the k-means algorithm to cluster sets of docu- ments with initial parameters from the result of min- imal closed subsets, the pseudoclosure distance to compute the distance between two objects and the inter-pseudoclosure distance to re-compute the new centroids. 4.1 Document structure analysis by LDA A term-document matrix is given as an input to LDA and it outputs two matrices: – The document-topic distribution matrix θ. – The topic-term distribution matrix φ. The topic-term distribution matrix φ ∈ RK×V consists of K rows, where the i-th row φi ∈ RV is the word distribu- tion of topic i. The terms with high φij values indicate that they are the representative terms of topic i. Therefore, by looking at such terms one can grasp the meaning of each topic without looking at the individual documents in the cluster. In a similar way, the document-topics distributions ma- trix θ ∈ RM×K consists of M rows, where the i-th row θi ∈ RK is the topic distribution for document i. A high probability value of θij indicates that document i is closely related to topic j. In addition, documents with low θij val- ues over all the topics are noisy documents that belong to none of the topics. Therefore, by looking at the θij values, one can understand how closely the document is related to the topic. 4.2 Defining binary relationships By using LDA, each document may be characterized by its topic distribution and also be labeled by the topic with the highest probability. In this subsection, we use this informa- tion to define the relations between two documents based on the way we consider the "similarity" between them. 4.2.1 Based on major topic Firstly, based on the label information, we can consider connecting the documents if they have the same label. However, in some cases such as noisy documents, the prob- ability of label topic is very small and it is not really good if we use this label to represent a document. Hence, we just use the label information if its probability is higher than threshold p0. We define the major topic of each document as: Definition 6. MTP (di) is the major topic of document di if MTP (di) is the topic with highest probability in the topic distribution of document di and this probability is greater than threshold p0, p0 ≥ 1/K, K is the number of topic. MTP (di) = {k|θik = maxjθij and θik ≥ p0}. Considering two documents dm,dn with their major topic MTP (dm), MTP (dn), we see that document dm is close to document dn if they have the same major topic. So, we proposed a definition of binary relationship RMTP of two documents based on their major topic as: Definition 7. Document dm has binary relationship RMTP with document dn if dm and dn have the same ma- jor topic. 174 Informatica 40 (2016) 169–180 Q.V. Bui et al. 4.2.2 Based on Hellinger distance Secondly, we can use the topic distributions of documents to define the relation based the similarity between two real number vectors or two probability distributions. If we con- sider a probability distribution as a vector, we can choose some distances or similarity measures related to the vector distance such as Euclidean distance, Cosine Similarity, Jac- card Coefficient, Pearson Correlation Coefficient, etc. But, it is better if we choose distances or similarity measures related to the probability distribution such as Kullback- Leibler Divergence, Bhattacharyya distance, Hellinger dis- tance, etc. We choose the Hellinger distance because it is a metric for measuring the deviation between two probabil- ity distributions, easily to compute and especially limited in [0, 1]. Definition 8. For two discrete probability distributions P = (p1, . . . , pk) and Q = (q1, . . . , qk), their Hellinger distance is defined as dH(P,Q) = 1√ 2 √√√√ k∑ i=1 ( √ pi − √ qi)2, (11) The Hellinger distance is directly related to the Eu- clidean norm of the difference of the square root vectors, i.e. dH(P,Q) = 1√ 2 ∥∥√P −√Q∥∥ 2 . The Hellinger distance satisfies the inequality of 0 ≤ dH ≤ 1. This distance is a metric for measuring the de- viation between two probability distributions. The distance is 0 when P = Q. Disjoint P and Q shows the maximum distance of 1. The lower the value of the Hellinger distance, the smaller the deviation between two probability distributions. So, we can use the Hellinger distance to measure the similarity be- tween two documents dm,dn. We then define the binary relationship RdH between two documents as: Definition 9. Document dm has binary relationship RdH with document dn if dH(dm, dn) ≤ d0, 0 ≤ d0 ≤ 1, d0 is the accepted threshold. 4.3 Building pseudoclosure function Based on two binary relationshipsRMTP andRdH , we can build the neighborhood basis (see. Algorithm 2) and then build the pseudoclosures (see Algorithm 3) for strong (with intersection operator) and weak (with union operator) Pre- topology. 4.4 Building pseudoclosure distance In standard k-means, the centroid of a cluster is the average point in the multidimensional space. Its coordinates are the arithmetic mean for each dimension separately over all the Algorithm 2 Neighborhood Basis Using Topic Modeling. Require: document-topic distribution matrix θ, corpusD Require: RMTP , RdH : family of relations. 1: procedure NEIGHBORHOOD-TM(D, θ, RMTP , RdH ) 2: loop for each relationRi ∈ {RMTP , RdH } 3: loop for each document dm ∈ D 4: loop for each document dn ∈ D 5: IfRi(dm, dn) then 6: Bi[dm].append(dn) 7: returnB = [B1, B2] . Output 8: end procedure Algorithm 3 Pseudoclosure using Topic Modeling. Require: B = (B1, B2),D = {d1, . . . , dM} 1: procedure PSEUDOCLOSURE(A,B,D) 2: aA=A 3: loop for each document dn ∈ D 4: If (A ∩ B1[dn] 6= ∅ or A ∩ B2[dn] 6= ∅) then 5: aA.append(dn) 6: return aA . Ouput 7: end procedure points in the cluster which are not effective with categor- ical data analysis. On the other hand, the pseudoclosure distance is used to examine the similarity using both nu- meric and categorical data. Therefore, it can contribute to improving the classification with k-means. Definition 10. We define δ(A,B) pseudoclosure distance between two subsets A and B of a finite set E: k0 = min(min{k|A ⊂ ak(B)},∞) k1 = min(min{k|B ⊂ ak(A)},∞) δ(A,B) = min(k0, k1) where ak(.) = ak−1(a(.)) Definition 11. We call DA(x) interior-pseudo-distance of a point x in a set A: DA(x) = 1 |A| ∑ y∈A δ(x, y). In case where A and B are reduced to one element x and y, we get the distance δ(x, y). For clustering docu- ments with k-means algorithm, we use the pseudoclosure distance δ(x, y) to compute distance between two docu- ments (each document represented by its topic distribution is a point x ∈ E) and the interior-pseudo-distance DA(x) to compute centroid of A (x0 is chosen as centroid of A if DA(x0) = minx∈ADA(x)). 4.5 Structure analysis with minimal closed subsets The two limits of the standard k-means algorithm are the number of clusters which must be predetermined and the randomness in the choice of the initial centroids of the clus- ters. Pretopology theory gives a good solution to omit these limits by using the result from minimal closed subsets. The algorithm to compute minimal closed subset is presented in algorithm 4. A Multi-Criteria Document Clustering Method Based on. . . Informatica 40 (2016) 169–180 175 Algorithm 4 Minimal closed subsets algorithm. Require: corpusD, pseudoclosure aA() 1: procedure MINIMAL-CLOSED-SUBSETS(D, aA()) 2: compute family of elementary closed subsets Fe 3: Fm = ∅ 4: loop until Fe = ∅ 5: Begin 6: Choose F ⊂ Fe 7: Fe = Fe − F 8: minimal = True 9: F = Fe 10: loop until F = ∅ and not minimal 11: Begin 12: ChooseG ∈ F 13: IfG ⊂ F then 14: minimal=False 15: Else 16: If F ⊂ G then 17: Fe = Fe − {G} 18: F = F −G 19: End 20: End 21: If minimal =True &&F /∈ Fm then 22: Fm = Fm ∪ F 23: return Fm . Ouput 24: end procedure By performing the minimal closed subset algorithm, we get the family of minimal closed subsets. This family, by definition, characterizes the structure underlying the data set E. So, the number of minimal closed subsets is a quite important parameter: it gives us the number of clus- ters to use in the k-means algorithm. Moreover, the ini- tial centroids for starting the k-means process can be deter- mined by using the interior-pseudo-distance for each mini- mal closed subset Fmj ∈ Fm (x0 is chosen as centroid of Fmj if DFmj (x0) = minx∈FmjDFmj (x)). 4.6 MCPTM algorithm In this subsection, we present The Method of Cluster- ing Documents using Pretopology and Topic Modeling (MCPTM) which clusters documents via the Topic Mod- eling and pseudoclosure. At first, an LDA Topic Modeling is learned on the documents to achieve topic-document dis- tributions. The major topic and Hellinger probability dis- tance are used to define relations between documents and these relations are used to define a pretopological space which can be employed to get preliminarily clusters of a corpus and determine the number of clusters. After that, k-means clustering algorithm is used to cluster the docu- ments data with pseudodistance and inter-pseudodistance. The MCPTM algorithm is presented in algorithm 5. 4.7 Implementation in python of the library AMEUR In this part, we briefly present our AMEUR library writ- ten in python. AMEUR is a project connecting the tools that come from the framework of Pretopology, Topic Mod- eling, multi-relations networks analysis and semantic rela- tionship. The library is composed of the following mod- ules: Pretopology, topicmodeling and nlp. Algorithm 5 The MCPTM algorithm: clustering docu- ments using Pretopology and Topic Modeling. Require: D: corpus from set of documents 1: procedure MCPTM(D) 2: θD ← LDA-GIBBS(D, α, β, T ) 3: B ← NEIGHBORHOOD-TM(D, θD ,RMTP ,RdH ) 4: aA← pseudoCLOSURE(B) 5: Fm ←MIMINAL-CLOSED-SUBSETS(D, aA()) 6: k = |Fm|: number of clusters 7: M = {mi},i=1,...,k ,mi = Centroid(Fmi ) 8: while clusters centroids changed do 9: for each x ∈ E −M do 10: compute δ(x,mi), i = 1, . . . , k 11: findm0 with δ(x,m0) = minδ(x,mi)i=1,...,k 12: Fm0 = Fm0 ∪ {x} 13: end for 14: Recompute clusters centroids M. 15: end while 16: return Clusters = {F1, F2, . . . , Fk} . Output 17: end procedure The Pretopology module implements the functions de- scribed in section III. The implementation of the Pre- topology in the AMEUR library allows us to ensures the follow-up of step-by-step processes like dilatation, al- liance, pseudoclosure, closure, family of minimal closed subsets, MCPTM and acceptability in multi-relations net- works. The topicmodeling module implements generative mod- els like the Latent Dirichlet Allocation, LDA Gibbs Sam- pling that allows us to capture the relationships between discrete data. This module is used within the AMEUR li- brary for querying purposes e.g to retrieve a set of docu- ments that are relevant to a query document or to cluster a set of documents given a latent topic query. These compu- tations of these queries are ensured by the connection be- tween the topicmodeling module and the Pretopology mod- ule. The nlp (natural language processing) module imple- ments the necessary functions for getting unstructured text data of different sources from web pages or social medias and preparing them as proper inputs for the algorithms im- plemented in other modules of the library. 5 Application and Evaluation The microblogging service Twitter has become one of the major micro-blogging websites, where people can create and exchange content with a large audience. In this sec- tion, we apply the MCPTM algorithm for clustering a set of users around their interests. We have targeted 133 users and gathered their tweets in 133 documents. We have cleaned them and run the LDA Gibbs Sampling algorithm to define the topics distribution of each document and words distri- bution of each topic. We have used then, the MCPTM al- gorithm to automatically detect the different communities for clustering users. We present in the following, the latter steps in more details. 176 Informatica 40 (2016) 169–180 Q.V. Bui et al. Table 1: Words - Topic distribution φ and the related users from the θ distribution Topic 3 Words Prob. Users ID Prob. paris 0.008 GStephanopoulos 42 0.697 charliehebdo 0.006 camanpour 23 0.694 interview 0.006 AriMelber 12 0.504 charlie 0.005 andersoncooper 7 0.457 attack 0.005 brianstelter 20 0.397 warisover 0.004 yokoono 131 0.362 french 0.004 piersmorgan 96 0.348 today 0.004 maddow 72 0.314 news 0.004 BuzzFeedBen 21 0.249 police 0.003 MichaelSteele 81 0.244 Topic 10 Words Prob. Users ID Prob. ces 0.010 bxchen 22 0.505 people 0.007 randizuckerberg 102 0.477 news 0.006 NextTechBlog 88 0.402 media 0.006 lheron 71 0.355 tech 0.006 LanceUlanoff 68 0.339 apple 0.006 MarcusWohlsen 74 0.339 facebook 0.005 marissamayer 76 0.334 yahoo 0.005 harrymccracken 43 0.264 app 0.005 dens 33 0.209 google 0.004 nickbilton 89 0.204 Table 2: Topics - document distribution θ User ID 02 Topic Prob. 10 0.090 16 0.072 12 0.065 18 0.064 0 0.058 User ID 12 Topic Prob. 3 0.504 19 0.039 10 0.036 15 0.035 13 0.032 User ID 22 Topic Prob. 10 0.506 3 0.036 19 0.034 14 0.031 4 0.03 User ID 53 Topic Prob. 17 0.733 1 0.017 18 0.016 13 0.016 11 0.015 User ID 75 Topic Prob. 19 0.526 2 0.029 3 0.029 5 0.028 105 0.028 User ID 83 Topic Prob. 8 0.249 0 0.084 11 0.06 7 0.045 12 0.043 5.1 Data collection Twitter is a micro-blogging social media website that pro- vides a platform for the users to post or exchange text mes- sages of 140 characters. Twitter provides an API that al- lows easy access to anyone to retrieve at most a 1% sample of all the data by providing some parameters. In spite of the 1% restriction, we are able to collect large data sets that contain enough text information for Topic Modeling as shown in [14]. The data set contains tweets from the 133 most famous and most followed public accounts. We have chosen these accounts because they are characterized by the heterogene- ity of the tweets they posts. The followers that they aim to reach comes from different interest areas (i.e. politics, technology, sports, art, etc..). We used the API provided by Twitter to collect the messages of 140 characters between January and February 2015. We gathered all the tweets from a user into a document. 5.2 Data pre-processing Social media data and mainly Twitter data is highly un- structured: typos, bad grammar, the presence of unwanted content, for example, humans expressions (happy, sad, ex- cited, ...), URLs, stop words (the, a, there, ...). To get good insights and to build better algorithms it is essential to play with clean data. The pre-processing step gets the textual data clean and ready as input for the MCPTM algorithm. 5.3 Topic modeling results After collecting and pre-processing data, we obtained data with 133 documents, 158,578 words in the corpus which averages 1,192 words per document and 29,104 different words in the vocabulary. We run LDA Gibbs Sampling from algorithm 1 and received the output with two matri- ces: the document-topic distribution matrix θ and the dis- Table 3: Classifying documents based on their major topic Major Topic prob ≥ 0.3 0.15 < prob < 0.3 Topic 0 112,85,104 - Topic 1 44,129,114 61 Topic 2 101,108,91 90 Topic 3 42,23,12,7,20, 21,81,93,10 131,96,72 Topic 4 125,36,123,0 - Topic 5 82,126 62 Topic 6 127,37,26 92 Topic 7 118,106,32 70,4 Topic 8 113 83,55,59 Topic 9 67,122 111,100 Topic 10 22,102,88,71,74, 43,89,33,65 68,76 Topic 11 54,51,121 29,94 Topic 12 50 12 Topic 13 16,35 38 Topic 14 31,98 - Topic 15 66,73,34, 48 Topic 16 99 - Topic 17 53,30 - Topic 18 47,128,1,124,5 78,115 Topic 19 14,80,39,75,18,103 - None remaining users (probability< 0.15) tribution of terms in topics represented by the matrix φ. We present in Table 1 two topics from the list of 20 topics that we have computed with our LDA implementation. A topic is presented with a distribution of words. For each topic, we have a list of users. Each user is identified with an ID from 0 to 132 and is associated with a topic by an order of probabilities. The two lists of probabilities in topic 3, 10 are extracted respectively from θ and φ distributions. The topic 3 and topic 10 are of particular interest due to the im- portant number of users that are related to them. Topic 3 is about the terrorist attack that happened in Paris and topic 10 is about the international Consumer Electronics Show (CES). Both events happened at the same time that we col- lected our data from Twitter. We note that we have more users for these topics than from other ones. We can con- clude that these topics can be considered as hot topics at this moment. Due to the lack of space, we could not present in details A Multi-Criteria Document Clustering Method Based on. . . Informatica 40 (2016) 169–180 177 all the topics with their distribution of words and all topic distributions of documents. Therefore, we presented six topic distributions θi (sorted by probability) of six users in the table 2. A high probability value of θij indicates that document i is closely related to topic j. Hence, user ID 12 is closely related to topic 3, user ID 22 closely related to topic 10, etc. In addition, documents with low θij values over all the topics are noisy documents that belong to none of the topics. So, there is no major topic in user ID 02 (the max probability < 0.15). We show in Table 3 clusters of documents based on their major topics in two levels with their probabilities. The doc- uments with the highest probability less than 0.15 are con- sidered noisy documents and clustered in the same cluster. 5.4 Results from the k-means algorithm using Hellinger distance After receiving the document-topic distribution matrix θ from LDA Gibbs Sampling, we used the k-means algo- rithm with Hellinger distance to cluster users. The table 4 presents the result from the k-means algorithm using Hellinger distance with a number of clusters k=13 and ran- dom centroids. Based on the mean value of each cluster, we defined the major topic related to the clusters and attached these values in the table. We notice that different choices of initial seed sets can result in very different final partitions. Table 4: Result from k-means algorithm using Hellinger distance Cluster Users Major Topic 1 67, 111, 122 TP 9 (0.423) 2 34, 48, 66, 73 TP 15 (0.315) 3 10, 22, 33, 43, 65, 68, TP 10 (0.305) 71, 74, 76, 88, 89, 98, 102 4 26, 92 TP 6 (0.268) 5 16, 35, 44, 90, 91, 101, TP 2 (0.238) 108, 114, 129 6 4, 32, 70, 106, 118 TP 7 (0.345) 7 37, 127 TP 6 (0.580) 8 14, 18, 39, 75, 80, 103 TP 19 (0.531) 9 1, 5, 47, 78, 124, 128 TP 18 (0.453) 10 30, 53 TP 17 (0.711) 11 7, 12, 20, 21, 23, 42, 72, TP 3 (0.409) 81, 93, 96, 131 12 0, 31, 36, 82, 123, 125 TP 4 (0.310) 13 remaining users None 5.5 Results from the MCPTM algorithm After getting the results (e.g table 2) from our LDA im- plementation, we defined two relations between two do- cements, the first based on their major topic RMTP and the second based their Hellinger distance RdH . We then built the weak pseudoclosure with these relations and ap- plied it to compute pseudoclosure distance and the mini- mal closed subsets. With this pseudoclosure distance, we can use the MCPTM algorithm to cluster sets of users with multi-relationships. Figure 4 shows the number of elements of minimal closed subsets with different thresholds p0 for RMTP and Figure 3: Network for 133 users with two relationships based on Hellinger distance (distance ≤ 0.15) and Ma- jor topic (probability ≥ 0.15). Figure 4: Number of elements of Minimal closed subsets with difference thresholds p0 for RMTP and d0 for RdH . d0 for RdH . We used this information to choose the num- ber of clusters. For this example, we chose p0 = 0.15 and d0 = 0.15 i.e user i connects with user j if they have the same major topic (with probability≥ 0.15) or the Hellinger distance dH(θi, θj) ≤ 0.15. From the network (figure 3) for 133 users built from the weak pseudoclosure, we chose the number of clusters k = 13 since the network has 13 connected components (each component represents an element of the minimal closed subset). We used inter- pseudoslosure distance to compute initial centroids and re- ceived the result: {0, 52, 4, 14, 26, 29, 30, 31, 34, 44, 85, 90, 99} Table 5 presents the results of the MCPTM algorithm and the k-means algorithm using Hellinger distance. We notice that there is almost no difference between the results from two methods when using the number of clusters k and initial centroids above. We saw that the largest connected component in the users network (fig. 3) has many nodes with weak ties. This component represents the cluster 13 with 89 elements. It contains the 8 remaining topics that were nonsignificant 178 Informatica 40 (2016) 169–180 Q.V. Bui et al. Table 5: Result from k-means algorithm using Hellinger distance and MCPTM K-means & Hellinger MCPTM Algorithm Cluster Users Topic Users Topic 1 0,36,123,125 TP 4 (0.457) 0,36,123,125 TP 4 2 4,32,70,10,118 TP 7 (0.345) 4,32,70,10,118 TP 7 3 14,18,39,75,80,103 TP 19 (0.531) 14,18,39,75,80,103 TP 19 4 26,37,92,127 TP 6 (0.424) 26,37,92,127 TP 6 5 29,51,54,94,121 TP 11 (0.345) 29,51,54,94,121 TP 11 6 30,53 TP 17 (0.711) 30,53 TP 17 7 31 TP 14 (0.726) 31,98 TP 14 8 34,48,66,73 TP 15 (0.315) 34,48,66,73 TP 15 9 44,61,114,129 TP 1 (0.413) 44,61,114,129 TP 1 10 85,104,112 TP 0 (0.436) 85,104,112 TP 0 11 67,90,91,101,108 TP 2 (0.407) 90,91,101,108 TP 2 12 99 TP 16 (0.647) 99 TP 16 13 remaining users None remaining users None or contains noisy documents without major topics. Hence, we used the k-means algorithm with Hellinger distance for clustering this group with number of clusters k = 9, cen- troids: {23, 82, 113, 67, 22, 50, 16, 47, 2} and showed the result in the table 6. Table 6: Result from k-means algorithm using Hellinger distance for cluster 13 (89 users) Cluster Users Major Topic 13.1 7, 12, 20, 21, 23, 42, 72, TP 3 ( 0.409) 81, 93, 96, 131 13.2 62, 77, 82, 126 TP 5 (0.339) 13.3 27, 55, 59, 83, 113 TP 8 (0.218) 13.4 67, 111, 122 TP 9 (0.422) 13.5 22, 33, 43, 65, 68, 71, 74, 76, TP 10 (0.330) 88, 89, 102 13.6 50 TP 12 (0.499) 13.7 16, 35 TP 13 (0.576) 13.8 1, 5, 47, 78, 124, 128 TP 18 (0.453) 13.9 remaining users None 5.6 Evaluation In this part of the article, we conducted an evaluation of our algorithm by comparing similarity measure of MCPTM (using the pseudocloseure distance with information from results of minimal closed subsets) and k-means with ran- dom choice. The evaluation is performed as follows: we firstly discovered the similarity measure of k-means us- ing three distances: Euclidean distance, Hellinger distance and pseudoclosure distance; we then compared similarity measures among three distances and the similarity measure when we use the number of clusters and the initial centroids from the result of minimal closed subsets. We used the sim- ilarity measure proposed by [17] to calculate the similarity between two clusterings of the same dataset produced by two different algorithms, or even the same K-means algo- rithm. This measure allows us to compare different sets of clusters without reference to external knowledge and is called internal quality measure. 5.6.1 Similarity measure To identify a suitable tool and algorithm for clustering that produces the best clustering solutions, it becomes neces- sary to have a method for comparing the different results in the produced clusters. To this matter, we used in this article the method proposed by [17]. To measure the "similarity" of two sets of clusters, we define a simple formula here: Let C = {C1, C2, . . . , Cm} and D = {D1, D2, . . . , Dn} be the results of two clus- tering algorithms on the same data set. Assume C and D are "hard" or exclusive clustering algorithms where clusters produced are pair-wise disjoint, i.e., each pattern from the dataset belongs to exactly one cluster. Then the similarity matrix for C and D is an m× n matrix SC,D. SC,D =  S11 S12 S13 . . . S1n S21 S22 S23 . . . S2n . . . . . . . . . . . . . . . Sm1 Sm2 Sm3 . . . Smn  (12) where Sij = p q , which is Jaccard’s Similarity Coefficient with p being the size of the intersection and q being the size of the union of cluster sets Ci and Dj . The similarity of clustering C and clustering D is then defined as Sim(C,D) = ∑ 1≤i≤m,1≤i≤m Sij max(m,n) (13) 5.6.2 Discussion Figure 5: Illustration of the similarity measure where we have the same initial centroids. The appreviation E stands for Euclidean distance, H for Hellinger distance an P for the pseudoclosure distance. A Multi-Criteria Document Clustering Method Based on. . . Informatica 40 (2016) 169–180 179 Table 7: The results of the clustering similarity for K-means with different distance measures. The abbreviation E stands for Euclidean distance, H for Hellinger distance (see definition 8) and P for the pseudoclosure distance (see definition 10 and 11). Same algorithm Same centroids Different centroids Inter-pseudo centroids k E H P E vs H E vs P H vs P E vs H E vs P H vs P E vs H E vs P H vs P 5 0.423 0.454 0.381 0.838 0.623 0.631 0.434 0.373 0.383 - - - 9 0.487 0.544 0.423 0.831 0.665 0.684 0.495 0.383 0.447 - - - 13 0.567 0.598 0.405 0.855 0.615 0.633 0.546 0.445 0.469 0.949 0.922 0.946 17 0.645 0.658 0.419 0.861 0.630 0.641 0.641 0.493 0.518 - - - 21 0.676 0.707 0.445 0.880 0.581 0.604 0.687 0.478 0.491 - - - 25 0.736 0.720 0.452 0.856 0.583 0.613 0.715 0.519 0.540 - - - 29 0.723 0.714 0.442 0.864 0.578 0.600 0.684 0.4885 0.511 - - - mean 0.608 0.628 0.423 0.855 0.611 0.629 0.600 0.454 0.480 0.949 0.922 0.946 We have compared the similarity measure between three k-means algorithms with different initializations of the cen- troids and different numbers of clusters k. We plotted the similarity measure between the clusters computed with the three k-means algorithms with the same initial centroid in Figure 5 and the three k-means algorithms with different initial centroids in Figure 6. Figure 6: Illustration of the similarity measure where we have diffirent initial centroids. The appreviation E stands for Euclidean distance, H for Hellinger distance an P for the pseudoclosure distance. We notice that in the both figures, the Euclidean Distance and the Hellinger distance have higher similarity measure. This is due to the fact that both distances are similar. In Fig- ure 5, we see a big gap between the clusters of Euclidean distance, Hellinger distance and the clusters from Pseuo- closure distance. This gap is closing in Figure 6 and starts opening again from k = 17. With a different initial cen- troids the pseudoclosure distance closed the gap between the k-means algorithms using Euclidean and Hellinger dis- tance. But, when k > 13, the number of closed subsets, the gap between the pseudoclosure and the other distances starts opening again. In table 7 where we applied the same algorithm twice, the similarity measure between two clus- ters results from k-means is low for all three distances: Eu- clidean, Hellinger, pseudoclosure distance. The different choices of initial centroids can result in very different final partitions. For k-means, choosing the initial centroids is very im- portant. Our algorithm MCPTM offers a way to com- pute the centroids based on the analysis of the space of data (in this case text). When we use the centroids com- puted from the results of minimal closed subsets that we present in Table 5, we have the higher similarity: 0,949 for Euclidean vs Hellinger; 0,922 for Euclidean vs pseu- docloure and 0,946 for Hellinger vs pseudoclosure. It means that the results from k-means using the centroids {0, 52, 4, 14, 26, 29, 30, 31, 34, 44, 85, 90, 99} is very sim- ilar with all three distances Euclidean, Hellinger, pseudo- closure. We can conclude that the result that we obtained from our MCPTM algorithm is a good result for clustering with this Twitter dataset. 6 Conclusion The major finding in this article is that the number of clus- ters and the chosen criterias for grouping the document is closely tied to the accuracy of the clustering results. The method presented here can be considered as a pipeline where we associate Latent Dirichlet Allocation (LDA) and pseudoclosure function. LDA is used to estimate the topic- distribution of each document in corpus and the pseudo- closure function to connect documents with multi-relations built from their major topics or Hellinger distance. With this method both quantitative data and categorical data are used, allowing us to have multi-criteria clustering. We have presented our contribution by applying it on microblogging posts and have obtained good results. In future works, we want to test these results on large scale and more conven- tional benchmark datasets. And we intend also to paral- lelize the developed algorithms. References [1] C. C. Aggarwal and C. Zhai. A survey of text classi- fication algorithms. In Mining text data, pages 163– 222. Springer, 2012. [2] M. Ahat, B. Amor S., M. Bui, S. Jhean-Larose, and G. Denhiere. Document Classification with LSA and Pretopology. Studia Informatica Universalis, 8(1), 2010. [3] Z. Belmandt. Basics of Pretopology. Hermann, 2011. 180 Informatica 40 (2016) 169–180 Q.V. Bui et al. [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirich- let allocation. J. Mach. Learn. Res., 3:993–1022, Mar. 2003. [5] A. Ferligoj and V. Batagelj. Direct multicriteria clus- tering algorithms. Journal of Classification, 9(1):43– 61, 1992. [6] T. L. Griffiths and M. Steyvers. Finding scientific top- ics. Proceedings of the National academy of Sciences of the United States of America, 101(Suppl 1):5228– 5235, 2004. [7] G. Heinrich. Parameter estimation for text analysis. Technical report, 2004. [8] T. Hofmann. Probabilistic latent semantic index- ing. In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, SIGIR ’99, pages 50– 57, New York, NY, USA, 1999. ACM. [9] D. Koller and N. Friedman. Probabilistic Graph- ical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009. [10] T. K. Landauer and S. T. Dutnais. A solution to Plato’s problem: The latent semantic analysis the- ory of acquisition, induction, and representation of knowledge. Psychological review, pages 211–240, 1997. [11] C. Largeron and S. Bonnevay. A pretopological ap- proach for structural analysis. Information Sciences, 144(1–4):169 – 185, 2002. [12] V. Levorato and M. Bui. Modeling the complex dy- namics of distributed communities of the web with pretopology. Proceedings of the 7th International Conference on Innovative Internet Community Sys- tems, 2007. [13] C. D. Manning and P. Raghavan. An Introduction to Information Retrieval, 2009. [14] F. Morstatter, J. Pfeffer, H. Liu, and K. M. Car- ley. Is the sample good enough? comparing data from twitter’s streaming API with twitter’s fire- hose. arXiv:1306.5204 [physics], June 2013. arXiv: 1306.5204. [15] M. Steyvers and T. Griffiths. Probabilistic topic models. Handbook of latent semantic analysis, 427(7):424–440, 2007. [16] N. K. Thanh Van Le and M. Lamure. A cluster- ing method associating pretopological concepts and k-means algorithm. The 12th International Confer- ence on Applied Stochastic Models and Data Analy- sis, 2007. [17] G. J. Torres, R. B. Basnet, A. H. Sung, S. Mukkamala, and B. M. Ribeiro. A similarity measure for cluster- ing and its applications. Int J Electr Comput Syst Eng, 3(3):164–170, 2009. Appendix List of Notations: Notation Meaning K number of topics V number of words in the vocabulary M number of documents M number of documents D corpus φj=1,...,K distribution of words in topic j θd=1,...,M distribution of topics in document d a(A) pseudoclosure of A F (A) closure of A Fe family of elementary closed subset Fm family of minimal closed subset δ(A,B) pseudodistance between A,B DA(x) interior-pseudodistance of x in A MTP (d) major topic of document d dH(P,Q) Hellinger distance between P,Q RMTP relationships based on major topic RdH relationships based on Hellinger distance k number of clusters