Informatica 39 (2015) 87-98 87 An Efficient Algorithm for Mining Frequent Closed Itemsets Gang Fang1 2, Yue Wu 1, Ming Li1 and Jia Chen1 1School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, 611731, P. R. China 2School of Computer Science and Engineering, Chongqing Three Gorges University, Wanzhou, Chongqing, 404000, P. R. China E-mail: gangfang07@sohu.com, ywu@uestc.edu.cn, ming.m.li@alcatel-lucent.com, jchen@uestc.edu.cn Keywords: frequent closed itemsets, Galois connection, granular computing, association rules, data mining Received: February 6, 2014 To avoid generating an undesirably large set of frequent itemsets for discovering all high confidence association rules, the problem of finding frequent closed itemsets in a formal mining context is proposed. In this paper, aiming to these shortcomings of typical algorithms for mining frequent closed itemsets, such as the algorithm A-close and CLOSET, we propose an efficient algorithm for mining frequent closed itemsets, which is based on Galois connection and granular computing. Firstly, we present the smallest frequent closed itemsets and its characters, contain some properties and theorems, then propose a novel notion, called the smallest frequent closed granule, which can help the algorithm save reading the database to reduce the costed I/O for discovering frequent closed itemsets. And then we propose a novel model for mining frequent closed itemsets based on the smallest frequent closed granules, and a connection function for generating the smallest frequent closed itemsets. The generator function create the power set of the smallest frequent closed itemsets in the enlarged frequent 1-item manner, which can efficiently avoid generating an undesirably large set of candidate smallest frequent closed itemsets to reduce the costed CPU and the occupied main memory for generating the smallest frequent closed granules. Finally, we describe the algorithm for the proposed model. On these different datasets, we report the performances of the algorithm and its trend of the performances to discover frequent closed itemsets, and further discuss how to solve the bottleneck of the algorithm. For mining frequent closed itemsets, all these experimental results indicate that the performances of the algorithm are better than the traditional and typical algorithms, and it also has a good scalability. It is suitable for mining dynamic transactions datasets. Povzetek: Opisan je nov algoritem asociativnega učenja za pogoste entitete. 1 Introduction Association rules mining is introduced in [1], Agrawal et al. firstly propose a classic algorithm for discovering association rules in [2], namely, the Apriori algorithm. However, it is also well known that mining frequent patterns often generates a very large number of frequent itemsets and association rules, which reduces not only efficiency but also effectiveness of mining since users have to sift through a large number of mined rules to discover useful ones. In order to avoid the shortcoming, Pasquier et al. introduce the problems of mining frequent closed itemsets in [3], and propose an efficient Apriori-based mining algorithm, called A-close. Subsequent, Zaki and Hsiao propose another mining algorithm in [4], called CHARM, which improves mining efficiency by exploring an item-based data structure. However, we find A-close and CHARM are still costly when mining long patterns or low minimum support thresholds in large database, especially, CHARM depends on the given data structure and need the overlarge memory. As a continued study on frequent patterns mining without candidate generation in [5], J. Pei et al. propose an efficient method for mining frequent closed itemsets without candidate generation in [6], called CLOSET. There are more study works for mining frequent closed itemsets in [7-13]. The familiar algorithms include MAFIA in [7], CLOSE+ in [8] and DCI-CLOSED in [9]. At present, for mining frequent closed itemsets, there are two types of main current methods as follows: The first is the method of mining frequent closed itemsets with candidate based on the Apriori algorithm in [3 and 14]. The A-close algorithm in [3] is a well-known typical algorithm for the first method, which adopts the bottom-up search strategy as the Apriori-like in [2], and constructs the set of generators in a level-wise manner: (i +1) - generators are created by joining i- generators . For the first method, the advantages are the less usage of memory, simple data structure, and easy implementing it and maintaining; its disadvantages are the more occupied CPU for matching candidate patterns, and the overlarge costed I/O for the repeatedly scanning the database to compute the support. The second is the method of mining frequent closed itemsets without candidate based on the FP-tree structure in [6, 15 and 16]. The CLOSET algorithm in [6] is an extended study of the FP-Growth for mining frequent patterns in [5]. For the second method, the advantages 87 Informatica 39 (2015 ) 87-98 G. Fang et al. are reducing the overlarge computing corresponding to the joined potential generators in the A-close algorithm, and saving the costed I/O of reading the database. But it has these disadvantages, such as complex data structure costs more memory, creating recursion FP-tree occupies more CPU, and implementing it is troublesome. Rough set theory in [17] and formal concept analysis in [18 and 19] are two efficient methods for the representation and discovery of knowledge in [20 and 21], Rough set theory and formal concept analysis are actually related and often complementary approaches to data analysis, but rough set models enable us to precisely define and analyse many notions of granular computing in [22 and 23], Reference [22] develops a general framework for the study of granular computing and knowledge reduction in formal concept analysis. In formal concept analysis, granulation of the universe of discourse, description of granules, relationship between granules, and computing with granules are issues that need further scrutiny. Since the basic structure of a concept lattice induced from a formal context is the set of object concepts and every formal concept in the concept lattice can be represented as a join of some object concepts, each object concept can be viewed as an information granule in the concept lattice. An important notion in formal concept analysis is thus a formal concept, which is a pair consisting of a set of objects (the extension) and a set of attributes (the intension) such that the intension consists of exactly those attributes that the objects in the extension have in common, and the extension contains exactly those objects that share all attributes in the intension in [22], For the study of granular computing, the formal concept is defined as a granule, such as an information granule. Based on the notions of granularity in [24] and abstraction in [25], the ideas of granular computing have been widely investigated in artificial intelligence in [26], such as, granular computing has been applied to association rules mining in [27 and 28], where a partition model of granular computing is applied to constructing information granule in [26], which depends on rough set theory in [29] and quotient space theory in [30], In this paper, we propose a novel model based on granular computing, namely, an efficient algorithm for mining frequent closed itemsets, which constructs the set of generators in the enlarged frequent 1 - item manner to reduce the costed CPU, and adopts granular computing to reduce the costed I/O. The rest of the paper is organized as follows: In Section 2, we present the related concepts with closed itemset and granular computing; In Section 3, we propose a novel model for mining frequent closed itemsets based on granular computing; In Section 4, we describe the efficient mining algorithm; Section 5 reports the performance comparison of our with A-close and CLOSET. In Section 6, we summarize study work and discuss some future research directions. 2 Related concepts In this section, referring to the definitions and theorems in [3, 4, 6, and 22], we present the following definitions, properties, theorems, and propositions with closed itemsets and granular computing. Definition 2.1 A formal context is a triplet I) = (U. A, R), where U = {ul,u2,...,un) (n= 11 ). called the universe of discourse, is a finite nonempty set of objects; A = {al,a2,...,am} (m=0 A ) . called the attributes set, is also a finite nonempty set of attributes; ici/x A, called the relations, is a binary relation between objects U and attributes A , where each couple (u. a) e R denotes the fact that the object u (ueU) is related to the attribute a (a e A). Here, we make the following ratiocinations become concise, and then let the attribute a(a e A) be Boolean, where each attribute is regarded as an item, i.e. the attributes set /1 is a general itemset. In fact, these ratiocinations are also suitable for the quantitative attributes. Definition 2.2 Galois connection, let D = (U, A, R) be a formal context, for O c U and I c A, we define: oX()): P(U) P(A), namely co(O) = {i e A | Vo e O, (o,i)eR}, which denotes the maximal set of items shared by all objects o (oe O) ; (p{I): IX A) -> P(U), namely (p( I) ={oeU\\/ie I,(o,i)e Rf , which denotes the maximal set of objects that have all items i (i e I); And the couple of applications (co. 9)(/1)D9)(/2); (1*) qc02=> ffl(C5) 2 co(02); (2)/cco(0)ci> Oc h(01) c h(02); An Efficient Algorithm for Mining Frequent... Informatica 39 (2015) 87-98 89 Definition 2.4 Closed itemsets, an itemsets Cc/1 from D is a closed itemset if and only if h(C) = C. The smallest (minimal) closed itemset containing an itemset I is obtained by applying h to I. Here, we call h(I) the closure of I. Theorem 2.1 For a formal context IJ= (11. A. R). let Ix, I2 c Abe two itemsets. We have: h(I1^I2) = h(h(I1)^h(I2)). Proof. Let Ix, I2 c A be two itemsets. v /j c h(/,). I2 c h(I2) (Extension) •/.u^cAaju^) .'.^u/Jc hihi^)uA(/2)) (Monotonicity) And-/ c /j u /2,12 c /j u I2 .". /*/,) c /K/j u I2),h(I2) c /K/j U /2) ••• ) ^ A(/2)) c A(A(/! u I2)) (Monotonicity) .-. A(A(/! )u4(/2))c ¿(/j u I2) (Idempotency) ,".A(/1u/2) = W^it/J). Proposition 2.1 For a formal context D=(U,A,R), the closed itemset A(/) corresponding to the closure by A of the itemset /(/ c /1) is the intersection of all objects in U that contain I: h(i)= n{©({o})i/cffl({0})}. oeU Proof. Let H = f| 0)({o}), where oeS S = {oeU\I c ft>({o})} . And we have h(I) = «((I) oeS0 S° = {oe U\oe (p{I)} . Let's show that S° = S, i.e. I c 0)({o}) ooe cp(I). "." (p{I) 3 {o}; .". «( is defined as a formal granule of the context D = {U, A, R), where / , called the intension of formal granule, is an abstract description of common features or properties shared by objects in the extension, which is expressed as I = {il,i2,...,ik}(/c A,k=OI p. (p(l), called the extension of formal granule, is the maximal set of objects that have all items i (i e I), which is expressed as (p(I) = {o e U\Vie I,(o,i) e R} . Definition 2.6 Intersection operation of two formal granules is denoted by 0, which is described as follows: There are two formal granules Ga =< . and Gp =< Ip, (p(Ip) > , respectively; then we have: G =< /,(p{I) >= Ga ® Gp =< Iayjlp,(p(Ia)r\(p{Ip) >. 3 A novel mining model Firstly, we present some definitions, properties, theorems, and corollaries from the Galois connection and granular computing. And propose a novel model for mining frequent closed itemsets based on granule computing. 3.1 Basic concepts Definition 3.1 Itemset support, for a formal context D = (U, A, R), the support of the itemset I is expressed as support( I) =Q(p(I) 0 M . Definition 3.2 Frequent itemsets, the itemset I is said to be frequent if the support of / in I) is at least the given minsupport. The set Fl of frequent itemsets in D is defined as Fl = {I c A \ support(I) > minsupport}. Property 3.1 All subsets of a frequent itemset are frequent; all supersets of an infrequent itemset are infrequent. (Intuitive in [2]) Definition 3.3 Frequent closed itemsets, the closed itemset C is said to be frequent if the support of C in D is at least the given minsupport. The set FCI of frequent closed itemsets in D is defined as follows: FCI = {Cc A\ C= h(C) a support(C) > minsupport} . Property 3.2 Frequent closed itemsets FCI is the subset of frequent itemset Fl, namely FCI c Fl. Definition 3.4 The smallest frequent closed itemsets, the frequent itemset I is said to be the smallest frequent closed itemset if V/° c /, support(I) < support(I°). The set FCmln of the smallest frequent closed itemsets in D is FCmin ={Ie FI\\/I° c I a support(I) < support(I°)}. Theorem 3.1 For a formal context D={U,A,R) , if I be a frequent closed itemset, and there is the smallest frequent closed itemset I\(p{I) = (p{P)), i.e. Vie FCI => 31' e FCmin a «?(/) =«?(/'). Proof. Let I i k , there are two cases as follows: (1) IfV /j c / (□ /j i k-l), and have support(I) < supportif) =i> V 1° c /j c / a support(I) < support(I°) . Since I e FCI c Fl, we have I e FCmln. Let /' = I, and we have /' e FCmln a cp(I) = (p(I'). (2) If 3 /; c I (□ /j i k-l), and have support(I) = support(/J q>(I) = ^(/j). (i) If V /2 c /j c / (□ /2 i k- 2), and support(/J < support(I2) => V/° c I2 c /j a support(lj) < support(I°). Since /, c / e FCI c Fl, we have /, e FCmln. Let /' = /,. and we have /' e FCmin a (p{I) = (pif) = (p{P). (ii) Otherwise 3 I2 c ^ c I (□ I2 i k - 2), and have support(lj) = support(I2) => <£>(/[) = V 1° c I a 1° e FCmin. Proof. Suppose I e FCmln =s> 3 c I a g . V /.C/Aiif^ 3 I2 c a support(I\) = support(I2) V c / a e -FCmto. Corollary 3.3 I e FCm, ^V 7° c I a e FCm,, (D/° I i Definition 3.5 The smallest frequent closed granules set, the formal granule G =< /, cp(l) > is said to be the smallest frequent closed granule Gmto if the intension I of Gis the smallest frequent closed itemset. The set /''G„„.„ of the smallest frequent closed granules is defined as: FGmln={G=< IMD >\IeFCmJ 3.2 Frequent closed itemsets mining In this section, we propose a novel model for mining frequent closed itemsets based on granule computing. Based on the previous introductions, the following is a formal statement of this model. For a formal context IJ= (11. A. k'). discovering all frequent closed itemsets in D can be divided into two steps as follows: (1)According to the minimal support given by user, mining the smallest frequent closed granules set in D. (Details in the steps from (1) to (18) from Section 4.2) (2)Based on the smallest frequent closed granules set, discovering all frequent closed itemsets in D. (Details in the steps from (19) to (21) from Section 4.2) Here the first step is based on definition 3.5, theorem 2.1, and theorem 3.2; the second step refers to Definition 2.4, Proposition 2.1, and Theorem 3.1(Corollary 3.1). From the theory, they provide the demonstration for the novel mining model. 4 The efficient mining algorithm In this section, we use an efficient mining algorithm to describe the novel model, which is denoted by EMFCI. 4.1 Generator function Here, we propose a function for generating the intension of the smallest frequent closed granules. Definition 4.1 Set vector operation □ for two sets is defined as follows: Let P = {p, p2,..., pJ,Q = {q1,q2,...,qn}be two sets, and then the set vector operation is expressed as P' Q (A) □ (0 {q} {q2} ... {qB}) iPj, f{Pi} {A,<7i> ?2} -{P2} (A, <71> (A^} - iPl^n) \{PJ ?l} 2> - iP^ln}) = {{Pi}APi,qi}APi,q7},-APi,qnSÄP7}AP7,qi}, {p2,q2},...,{p2,qB},-,{pJApm,qi}Apm,q2}, ....! /),,,. (/„ | (Formal notation) ...,{p2qB},...,{pa},{paql},{paq7),...,{paqB}. (Simple notation) The operation is the main idea of generator function, let P, Qbe two sets, it is expressed as /'(!'. Q) = PTU Q. The application of /'(!'. Q) refers to Section 4.2. For example, for a formal context D = (11. A. k'). let /1 be a general itemset {a, b, c}, and then we use the set vector operation to generate P(A) ( V/; e I}(A) / p ^ 0) as follows: (1 )P(A) = %- (2) 4 = {a} P( A) = P( A) u (II □ P( A)) = ({a})D (0) = {{a}}; An Efficient Algorithm for Mining Frequent... Informatica 39 (2015) 87-98 91 (3) 4 = {b} P(A) = P(A) u (/J □ P(A)) = {{a}}u (({&})□ (0 {a})) = {{a},{b},{ab}}; (4) /,={<:}=> P( A) = P( A)u (/J □ P( A)) = {{a},{6},{a6}}u(({c}) □ (0 {a} {6} {a6}) = {{a},{b},{ab},{c},{ac},{bc},{abc}} . For a formal context D = (U. A,R), if /1 is a general itemsets, namely, it is a set of Boolean attributes, P(A) is general the power set where ¡'(A) i 2 1D-1. But if A is a set of quantitative attributes, where P(A) is called the extended power set of A , and P( A) [is expressed as: 0P(A) §]~7( \Q +5-1 , here Va is a reprocessed as A discrete range of attribute ae A. 4.2 An algorithm for mining frequent closed itemsets In this section, we describe the efficient algorithm based on the novel model in Section 3 via the following pseudo code. Algorithm: EMFCI Input: a formal context D = (II. A. R). the minimal support minsupport. Output: frequent closed itemsets FCI. (1)Read D\ (2)Construct FG = {FGa\ ae AaV G=< /,e FGa a/ c Va a □ I i 1a ($1) > Minsupport} ; (3) F = {Fa c Va | Vve Fa a G =<{v},(p{{v}) >e FGa a FGa e FG / ri e A] : 111/ is the range of attribute ae A. (4) FCmin = 0; (5)For(Va e F) do begin (6) Sc = a □ FCmln; //Generate the candidate (7) For (Vs e Sc) do begin (8) If ((Vi, e Nfi Atl(£s)A (V£j e NFCmln At2ct s)) then (9) Construct G =< s, (p{s) > ; (10) If (□ (p{s) a minsupport) then (11) If (Vic saD<£>(s)) 0 ^i)) )then (12) Write G =< s, (p{s) > to FGmin; (13) Write s to FCmin; (14) else (15) Write s to NFCmln; (16) else (17) Write s to /Vi ;: (18)End (19)For(VG =< !.e FGmln) do begin (20) Write Ii(l) = co(, < {b},{ 2,3,4} >, <{c},{1,2,5} >,<{e},{3,4,5} >} (Pruning {d}by property 3. land definition 3.3) 2 F = {{a},{6},{c},{e}} 3 a={a}^Sc={{a}} FGmln = {<{a},{1,3,5}>}, FCmin ={{a}} 4 a={b}^Sc={{b},{ab}} FGmln = {< {a}, {1,3,5} >, < {b},{ 2,3,4} >} FCmln ={{a},{b}} (Pruning{ab} by property 3.land definition 3.3) 5 a={c}^Sc={{c},{ac},{bc}} FGmln = {< {a}, {1,3,5} >, < {b},{ 2,3,4} >} <{c},{1,2,5} >,<{ac},{1,5} >} FCmin={{a},{b},{c},{ac}} (Pruning {hc\ by property 3.land definition 3.3) 6 a ={e}^> Sc ={{e},{ae},{be},{ce},{ace}} FGmln = {< {a}, {1,3,5} >, < {b},{ 2,3,4} >} < {c},{1,2,5} >, < {ac},{1,5} >, < {e},{3,4,5} >, < {ae},{3,5} >, < {be},{3,4} >} FCmin ={{a},{b},{c},{ac},{e},{ae},{be}} (Pruning{ce, ace} by property 3.land definition 3.3y Note: the search course is ended, discovering all the smallest frequent closed granules FCmln 7 A({a}) ={«! nu3 nu5} = {a} h({b}) ={u2 du3 nu4} = {b} A({c}) = {u1nu2nu5} = {c} A({ac})={u1nu5} = {ac} A({e})={u3nu4nu5} = {e} A({ae}) = {u3nu5} = {ae} h({be})={u3r^u4} = {be} Note: based on the smallest frequent closed granules set FCmln, getting all frequent closed itemsets 8 Answer FCI = {{a},{b},{c},{ac},{e},{ae},{be}} Table 1: Frequent closed itemsets mining for minsupport = 40% . 92 Informatica 39 (2015 ) 87-98 G. Fang et al. For a formal context D = (U, A, R), where A = {a, b, c, d, e},U = {u1, u2, u3, u4, u5}, u = {acd}, u2 = {bc}, u3 = {abe}, u4 = {be}, u5 = {ace}; and minsupport = 40% . The course of discovering frequent closed itemsets is described as table 1. For mining frequent closed itemsets, the algorithm adopts some pruning strategies as follows, property 3.1, definition 3.3 and 3.4, and theorem 3.3. They can help the algorithm efficiently reduce the search space for mining frequent closed itemsets. 5 Performance and scalability study In this section, we design the following experiments on these different datasets: Firstly, we report the performances of the algorithm EMFCI with A-Close and CLOSET on the six different datasets. Secondly, we report the relationships between some parameters of the datasets and the performances of the algorithm EMFCI for mining frequent closed itemsets. Finally, for the bottleneck of the algorithm EMFCI, we improve it to get the algorithm IEMFCI, and report its performances on the extended high dimension dataset to show the scalability of the algorithm EMFCI. There are two original datasets as follows: The first is the Food Mart 2000 retail dataset, which comes from SQL Server 2000. It contains 164558 records in 1998. By the same customer at the same time as a basket, we take items purchased from these records. Because the supports of the bottom items are small, we generalize the bottom items to the product department. Finally, we obtain 34015 transactions with time-stamps. It is a dataset with the Boolean attributes. The second is from a Web log data, which is a real data that expresses some behaviour of students browsing, where the attributes set is made of login time, duration, network flow, IDtype, and sex . The dataset with the discrete quantitative attributes has 296031 transactions. Now, we generalize attributes, and replicate some attributes or transactions to create the following extended % datasets described as table 2, where each dataset can be £ defined as a formal mining context l)= (II. /1. R). | All the experiments are performed on an Intel (R) Core (TM)2 Duo CPU (T6570 a) 2.10 GHz 1.19GHz) PC with 1.99 GB main memory, running on Microsoft Window XP Professional. All the programs are written in C# with Microsoft Visual Studio 2008. The algorithm A-close and CLOSET are implemented as described in [3] and [6]. Name Descriptions □ P( A) □ Uh Dataset The first original 222 -1 ; 1 dataset 34015 Dataset Replicating dataset 1 225 -1 ; 2 three attributes 34015 Dataset Replicating dataset 1 222 -1 ; 3 four times 5*34015 Dataset The second 5*4*4*14*3-1; 4 original dataset 296031 Dataset Replicating dataset 1 5*4*4*14*3*5-1; 5 one attribute 296031 Dataset Replicating dataset 4 5*4*4*14*3-1; 6 one time 2*296031 For the Food Mart 2000, we regard the Dataset 7 same customer at the same time as a basket 2102 -1; 34015 and generalize the bottom items to the product subcategory Table 2: The datasets used in the experiments. 5.1 The experiments of performance comparison In this section, for discovering frequent closed itemsets on these different datasets, we compare the algorithm EMFCI with the algorithm A-close and CLOSET from the following two aspects, namely, one is comparing the performances among them as the minimal support is added; the other is comparing them as the number of frequent closed itemsets is added. 1. Testing on the original datasets For the two original datasets, we firstly compare the algorithm EMFCI with the A-close and CLOSET based on the varying minimal support and the number of frequent closed itemsets. These experimental results are described as figure 1, 2, 3, and 4, respectively. i 1U 2. illj filler xgfiort^c'i Figure 1: Performance comparison with the support on dataset 1. 2 01 : " 0- . i Î 0 -•-•- 0.7; 1.5 \t L : »M -¿tf : :ti 1} U' -¿jy iL |"| -,-H llri i-l 1 eiXKOLl ir.iSs Figure 2: Performance comparison with the number of frequent closed itemsets on dataset 1. An Efficient Algorithm for Mining Frequent... Informatica 39 (2015) 87-98 93 Figure 3: Performance comparison with the support on dataset 4. Figure 5: Performance comparison with the support on dataset 2. Figure 6: Performance comparison with the number of frequent closed itemsets on dataset 2. Figure 7: Performance comparison with the support on dataset 3. Figure 8: Performance comparison with the number of frequent closed itemsets on dataset 3. Figure 4: Performance comparison with the number of frequent closed itemsets on dataset 4. Based on the comparison results from figure 1, 2, 3, and 4, we know that the performances of the algorithm EMFCI are better than the A-close and CLOSET. Obviously, the algorithm CLOSET is also superior to the A-close. Hence, we don't compare the EMFCI with the A-close in the following experiments. 2. Testing on the extended datasets We further report the performances of the algorithm EMFCI on the extended datasets. Based on the different minimal support and the number of frequent closed itemsets, we compare the EMFCI with the CLOSET, the experimental results are described as figure 5 to 12. Figure 9: Performance comparison with the support on dataset 5. Figure 10: Performance comparison with the number of frequent closed itemsets on dataset 5. Figure 11: Performance comparison with the support on dataset 6. Figure 12: Performance comparison with the number of frequent closed itemsets on dataset 6. Based on the comparison results from figure 5 to 12, we know that the performances of the algorithm EMFCI are also better than the CLOSET on the datasets with the Boolean or quantitative attributes. 5.2 The relationships between these parameters and performances In this part, we mainly discuss the relationships between the performances and the following parameters: □ U [, is the number of objects in the formal mining context D = (U, A, R), in other word, it is the number of transactions in the mining database. 94 Informatica 39 (2015 ) 87-98 G. Fang et al. □ P(I) □ is the number of nonempty power sets for attribute values, called the search space of the algorithm, where I is the smallest frequent closed itemsets from the attribute set A , P(I) is defined as the power set of I. (Refer to section 4.1) Here, the representation of the performances has two kinds of parameters as follows: t(x): is the runtime of algorithm x, which is from input to output for mining frequent closed itemsets. p, is defined as the improved ratio of the runtime between the algorithm EMFCI and CLOSET, which is denoted by the following equation: p = 1 -1(EMFCI) / t(CLOSET). 1. The relationships between the performances and the search space (l)Reporting the relationships on the extended dataset of the first original dataset For the first original dataset, namely, dataset 1, we test the trend of the performances as the search space is increasing on dataset 2, which is the extended dataset with replicating three attributes of the first dataset. As the search space is varying, the trend of the runtime for the algorithm EMFCI is expressed as figure 13, the trend of the improved ratio between the algorithm EMFCI and CLOSET is expressed as figure 14. Figure 13: The trend of the runtime on dataset 2. Figure 14: The trend of the improved ratio on dataset 2. Based on figure 13, we know that the runtime is added as the search space is increasing. Based on figure 14, we find that the improved ratio is reduced as the search space is increasing. (2)Reporting the relationships on the extended dataset of the second original dataset For the second original dataset, namely, dataset 4, we extend an attribute to get dataset 5, and test the trend of the performances on the dataset. The experimental results are expressed as figure 15 and 16, respectively. Figure 16: The trend of the improved ratio on dataset 5. According to figure 15 and 16, we get the similar comparisons results as above. Hence, we can draw the following conclusions: The runtime of the algorithm EMFCI is added as the search space is increasing; on the contrary, the improved ratio is reduced. Namely, if the search space is increasing, the performances of the algorithm EMFCI will become worse and worse. In other word, the algorithm is not suitable for mining the dataset with too many smallest frequent closed itemsets. 2. The relationships among the performances, the search space and the number of objects (1)Reporting the relationships on the first original dataset and its extended dataset For the first original dataset (dataset 1), and its extended dataset, dataset 3 with replicating its objects four times, we test the trend of the performances as the search space is increasing on the two datasets. As the search space is varying, the trend of the runtime for the algorithm EMFCI is expressed as figure 17, the trend of the improved ratio between the algorithm EMFCI and CLOSET is expressed as figure 18. Figure 17: The trend of the runtime on dataset 1 and 3. Figure 15: The trend of the runtime on dataset 5. Figure 18: The trend of the improved ratio on dataset 1 and 3. Based on figure17, we know that the runtime of the algorithm is added as the search space or the number of objects is increasing. Based on figure18, we find that the improved ratio of the algorithm is reduced as the search space is increasing, but it become relatively stable as the number of objects is increasing. (2)Reporting the relationships on the second original dataset and its extended dataset For the second original dataset, namely, dataset 4, we replicate its objects one time to get dataset 6, and test the trend of the performances on the dataset 4 and 6. The An Efficient Algorithm for Mining Frequent... Informatica 39 (2015) 87-98 95 experimental results are expressed as figure 19 and 20, respectively. Figure 19: The trend of the runtime on dataset 4 and 6 In this section, we adopt a partitioning method to avoid the bottleneck. In other word, the overlarge search space is divided into some smaller search spaces. The theoretical basis can be described as follows: Let/ = {a'> ,a'2 .....a \(I c /1), and then we have the following11 IXI) ||= n;",(|| Vtl || +1)-1, namely, ii P(D ii +i=n;:1(iiF, ii+d = (II V. II+IMH V, ||+l)-...-(||K. || +!)• K^ ll+D-dl^ II+1)--- V, l+l)-...- v, II+1): '«¡1+W2 +■ ■ jc-1) 11 Figure 20: The trend of the improved ratio on dataset 4 and 6 According to figure 19 and 20, we draw the same conclusions as follows: The runtime of the algorithm EMFCI is added as the search space or the number of objects is increasing, the improved ratio of the algorithm is reduced as the search space is increasing, but it become relatively stable as the number of objects is adding. Namely, the performances of the algorithm EMFCI will become relatively stable as the number of objects is increasing. Hence, it is suitable for mining dynamic transactions datasets. According to all these experimental results, we can draw the following conclusions: (1) The performances of the algorithm EMFCI are better than the traditional typical algorithms for mining frequent closed itemsets on the datasets with the Boolean attributes or the luantitative attributes. (2) The runtime of the algorithm EMFCI is added as the search space. If the search space is too large, its performances will become worse and worse. This is the bottleneck of the algorithm. (3) The runtime of the EMFCI is also added as the number of objects is increasing. (4) For the algorithm CLOSET, the improved ratio of the algorithm is reduced as the search space is adding, but it become relatively stable as the number of objects is increasing. Namely, the performances of the EMFCI will become relatively stable as the number of objects is increasing. It is suitable for mining dynamic transactions datasets. 5.3 A further discussion for solving the bottleneck of the algorithm Based on these conclusions in section 5.2, for the formal mining context D = (11. A. R). if the search space □ P(I) I is overlarge, where /(/c A) is the smallest frequent closed itemsets, P(I) is defined as the power set of I, the performance of EMFCI will become worse and worse. (n\ + m2+...+ mk = m). Obviously, we also have || P{I) || +1 = (iim^ii+D-di^^jii+D-.-.-dimju+i): Where I = {a'\a'=.....a'-}. T — S a ®>+1 o ®>+2 a 1m2 ~\a >a >-~>a T _ f Im, ~ia -m2 ^ tm[+mi+..±mQi-Y)+mk ^ In this paper, we let || P(Im ) ||, < {£},{2,3,4} >, < {c},{1,2,5} >, < {e},{3,4,5} >} . Step2. F = {{a},{Z?},{c},{e}},|| P(F) ||= 15 > A = 4 . Step3. Partitioning the search space, get two search spaces/^ = {{a}, {£}}, F2 ={{c},{e}}, where || P(Ft) ||< 4 . Step4. For the first search space Fl = {{a},{b}} , have ®a ={a}=> Sc ={{a}} FGlmln = {< {a},{l,3.5} >}. FClmln = {{a}} : ©a ={&}=> 5c={{Z>},{aZ>}} FGlmln = {< {a},{l, 3.5} >. < {£}.{2.3.4} >} . FClmln={{a},{b}}. For the second search space F2 ={{c},{e}}, have ®a ={c}=> Sc ={{c}} FGL ={<{c}.{1.2.5}>}. FC\n ={{c}} ; 96 Informatica 39 (2015 ) 87-98 G. Fang et al. ©a ={e}=> Sc = {{e},{ce}} FGlln = {<{c}.{1.2.5}>.<{e}.{3.4.5}>} . FCi,={{c}.{e}}. Step5. F = {FClmln,FC2mJ , repeating the step2, where || P{F) ||= 15 > 4 , but || F||= 2 , the partitioning operation must be ended; otherwise, the algorithm need to continue to partition the search space. a = FC)nin^Sc={{a},{b}}, FGmin = {< {a},{l, 3.5} >. < {b},{2.3.4} >} . FCmln={{a},{b}}: a = FCln^Sc=^(0 {a} {6}) = {{c},{ac},{bc},{e},{ae},{be}}; FGmln = {< {a},{l, 3.5} >. < {b},{ 2.3.4} >} <{c},{1, 2,5} >, <{ac},{l, 5} >,<{e},{3,4,5} >, < {ae},{3,5} >, < {fe},{3,4} >} FCmin={{a},{b},{c},{ac},{e},{ae},{be}} The rest of steps are the same as the example in section 4.3. The algorithm IEMFCI reduces the checking of itemset{ ace}, but adds the task of partitioning. As the number of transactions is lesser, the example does not show its advantage, please see the experiments in section 5.3.3. Here, the example only describes the execution course of IEMFCI. 5.3.2 Comparisons of the time and space complexity For l)= (11. /1. A') . let C be a set of frequent closed itemsets, and let L be the average length of frequent closed itemsets, k > 2 is a parameter with partitioning the search space. The comparisons are expressed as table 3. Items Time complexity Space complexity A-close ancih 0(11 C||/|| A\\) CLOSET 0(11 C||2) 0(11 C||) IEMFCI 0((L/k + \)-\\C||) 0(\\C\\/k\\ A\\) Figure 21: Performance comparison with the lower support on dataset 7. Figure 22: Performance comparison with the higher support on dataset 7. Then, for the improved algorithm IEMFCI, we adopt different parameters X to test its trend of performance, where X = 25, X = 219 and X = I22. The comparison result is expressed as figure 23, where IEMFCI (X = p(2. n)) is the improved algorithm IEMFCI when the parameter of partitioning the search space is X = p(2, n) = 2". Table 3: Comparisons of the time and space complexity. 5.3.3 Test on the high dimension datasets In this section, to show the scalability of the algorithm EMFCI, firstly, we compare the improved algorithm IEMFCI with EMFCI, A-close and CLOSET on the high dimension dataset (dataset 7 as table 1), which is an extended dataset based on the first original dataset. The comparison results are expressed as figure 21 and 22, where the parameter p{2, m) = 2"' on the abscissa shows the search space □ P{I) [of the given support. Figure 23: The trend of performance with the different parameter on dataset 7. Based on these comparisons, we draw the following conclusions: Firstly, the improved algorithm IEMFCI is better than the algorithms EMFCI, A-close and CLOSET. Secondly, the improved algorithm IEMFCI gets rid of the bottleneck in the algorithms EMFCI, especially, when the search space □ P(I) [is overlarge, the advantage of IEMFCI is very distinct. Finally, for the improved algorithm IEMFCI, the parameter of partitioning the search space is not too big, but it is not too small. 6 Conclusion In this paper, for the shortcomings of typical algorithms for mining frequent closed itemsets, we propose an efficient algorithm for mining frequent closed itemsets, which is based on Galois connection and granular computing. We present the notion of smallest frequent closed granule to reduce the costed I/O for discovering frequent closed itemsets. And we propose a connection function for generating the smallest frequent closed itemsets in the enlarged frequent l-item manner to An Efficient Algorithm for Mining Frequent... reduce the costed CPU and the occupied main memory. But the number of the smallest frequent closed itemsets is too many, the performances of the algorithm become worse and worse, so we further discuss how to solve the bottleneck, namely, propose its improved algorithm on high dimension dataset. The algorithm is also suitable for mining dynamic transaction datasets. Acknowledgement The authors would like to thank the anonymous reviewers for the constructive comment. This work was a project supported by Chongqing Cutting-edge and Applied Foundation Research Program (Grant No. cstc2014jcyjA40035). And it was also supported by Scientific and Technological Research Program of Chongqing Three Gorges University (Grant No.13ZD20). References [1] R. Agrawal, T. Imielinski, and A. Swami (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD Int'l Conference on Management of Data, Washington DC, USA, pp. 207-216. [2] R. Agrawal and R. Srikant (1994). Fast algorithms for mining association rules. In Proceedings of the 20th Int'l Conference on Very large Data Bases, Santiago, Chile, pp. 487-499. [3] N. Pasquier, Y. Bastide and R. Taouil et al. (1999). Discovering frequent closed itemsets for association rules. In Proceedings of the 7th Int'l Conference on Database Theory, Jerusalem, Israel, January, pp. 398-416. [4] Mohammed J. Zaki, Ching-Jui Hsiao (1999). Charm: An efficient algorithm for closed association rule mining. Technical Report 99-10, Computer Science, Rensselaer Polytechnic Institute. [5] J. Han, J. Pei, and Y. Yin (2000). Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD Int'l Conference on Management of Data, New York, USA, pp. 1-12. [6] J. Pei, J. Han, and R. Mao (2000). CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. In Proceedings of the 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. Dallas, Texas, USA, pp. 2130. [7] D. Burdick, M. Calimlim, and J. Gehrke (2001). MAFIA: A maximal frequent item set algorithm for transactional databases. In Proceedings of the 17th Int'l Conference on Data Engineering. Heidelberg, pp. 443-452. [8] J. Y. Wang, J. Han, and J. Pei (2003). CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In Proceedings of the 9th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 236 - 245. Informatica 39 (2015) 87-98 97 [9] C. Lucchese, S. Orlando, and R. Perego (2006). Fast and memory efficient mining of frequent closed itemsets. IEEE Trans on Knowledge and Dada Engineering, vol. 18, no. 1, pp. 21- 36. [10] R. Singh, T. Johnsten, and V. Raghavan et al (2010). Efficient Algorithm for Discovering Potential Interesting Patterns with Closed Itemsets. In Proceedings of the 2010 IEEE Int'l Conference on Granular Computing, pp. 414 - 419. [11] F. Nori, M. Deypir, and M. Hadi et al. (2011). A new sliding window based algorithm for frequent closed itemset mining over data streams. In Proceedings of the 1st Int'l Conference on Computer and Knowledge Engineering, IEEE Press, pp. 249253. [12] Guang-Peng Chen, Yu-Bin Yang, and Yao Zhang (2012). MapReduce-Based Balanced Mining for Closed Frequent Itemset. In Proceedings of the 2012 IEEE 19th Int'l Conference on Web Services, IEEE Press, pp. 652 - 653. [13] M. Sreedevi, Reddy L.S.S. (2013). Mining regular closed patterns in transactional databases. In Proceedings of the 2013 7th Int'l Conference on Intelligent Systems and Control, IEEE Press, pp. 380 - 383. [14] Yu-quan Z. and Yu-qing S. (2007). Research on an Algorithm for Mining Frequent Closed Itemsets. Journal of Computer Research and Development, vol. 44, no. 7, pp. 1177-1183. [15] Shengwei L., Lingsheng L., and Chong H. (2009). Mining closed frequent itemset based on FP-Tree. In Proceedings of the IEEE Int'l Conference on Granular Computing, IEEE Press, pp. 354 - 357. [16] Wachiramethin J., Werapun J. (2009). BPA: A Bitmap-Prefix-tree Array data structure for frequent closed pattern mining. In Proceedings of the 2009 Int'l Conference on Machine Learning and Cybernetics, IEEE Press, vol .1, pp. 154 - 160. [17] Z. Pawlak (1982). Rough sets. Journal of computing and information science in Engineering, no.11, pp. 341-356. [18] R. Wille (1982). Restructuring lattice theory: an approach based on hierarchies of concepts. In: I. Rival (Ed.), Ordered Sets, Reidel, Dordrecht-Boston, pp. 445-470. [19] B. Ganter, R. Wille (1999). Formal Concept Analysis, Mathematic Foundations. Springer, Berlin. [20] J. Poelmans, D. I. Ignatov, and S. O. Kuznetsov et al (2013). Formal concept analysis in knowledge processing: A survey on applications. Expert Systems with Applications, vol.40, no. 16, pp. 6538-6560. [21] M. W. Shao, Y. Leung (2014). Relations between granular reduct and dominance reduct in formal contexts. Knowledge-Based Systems, vol.65, pp. 111. [22] Wei-Zhi W., Yee Leung, Ju-Sheng M. (2009). Granular Computing and Knowledge Reduction in Formal Contexts. IEEE Transactions on Knowledge and Data Engineering, vol.21, no.10, pp. 1461-1474. 98 Informatica 39 (2015 ) 87-98 G. Fang et al. [23] R. Belohlavek, B. D. Baets, J. Konecny (2014). Granularity of attributes in formal concept analysis. Information Sciences, vol.260, pp.149-170. [24] Hobbs J. R. (1985). Granularity. In Proceedings of the 9th International Joint Conference on Artificial Intelligence, San Francisco, USA, pp. 432-435. [25] Giunchglia F., Walsh T. (1992). A theory of abstraction. Artificial Intelligence, vol. 57, no. 2-3. pp. 323-389. [26] Yao Y.Y. (2004). A partition model of granular computing. Lecture Notes in Computer Science Transactions on Rough Sets, vol. 3100, pp.232-253. [27] T. R. Qiu, X. Q. Chen, and Q. Liu et al. (2010). Granular Computing Approach to Finding Association Rules in Relational Database. International Journal of intelligent systems, no. 25, pp. 165-179. [28] G. Fang, Y. Wu (2013). Frequent Spatiotemporal Association Patterns Mining Based on Granular Computing. Informatica, vol.37, no.4, pp.443-453. [29] Pawlak Z. (1998). Granularity of knowledge, indiscernibility and rough sets. In Proceedings of IEEE Int Conf on Fuzzy Systems, IEEE Press, Anchorage, AK, pp.106-110. [30] Zhang L., Zhang B. (2003). The quotient space theory of problem solving. Lecture Notes in Computer Science, vol. 2639, pp. 11-15.