Experimental evaluation of three partition selection criteria for decision table decomposition Blaž Zupan and Marko Bohanec Jožef Štefan Institute, Jamova 39, Ljubljana, Slovenia Phone: +386 61 177 3900 Fax: +386 61 125 1038 E-mail: blaz . zupanSi j s . si , marko. bohanecSi j s . s i Keywords : decision table decomposition, partition selection criteria, intermediate concepts, concept hierarchy, knowledge discovery Edite d by : Rudi Murn Received : May 20, 1997 Revised : March 7, 1998 Accepted : April 18, 1998 Decision table decomposition is a machine learning approach that decomposes a given decision table into an equivalent hierarcby of decision tables. Tlie appioach aims to discover decision tables that are overall less complex than the initial one, potentially easier to interpret, and introduce new and meaningful intermediate concepts. Since an exhaustive search for an optimal hierarchy of decision tables is prohibitively complex, the decomposition ušes a suboptimal iterative algo­rithm that requires the so-called partition selection critcrion to decide among possihle candidates for decomposition. This article introduces two such criteria and experimentally compares their performance with a critcrion originally used for the decomposition of Boolean functions. The experiments highlight the differences bet\veen the criteria, but also sho\v that in aH three cases the decomposition may discover meaningful intermediate concepts and relatively compact decision tables. Introduction bles. As each decision table represents a concept, the result of decomposition can be regarded also as a con-A decision table provides a simple means for concept cept hierarchij. representation. It represents a concept with labeled Each single decomposition step aims to minimize the instances, each relating a set of attribute values to a joint complexity of G and H and executes the decom­class. Decision table decomposition is a method based position only if this is lower than the complexity of F. on the "divide and conquer" approach: given a deci- Moreover, it is of crucial importance for the algorithm sion table, it decomposes it to a hierarchy of decision to find such partition of attributes X into sets A and tables. The method aims to construct the hierarchy B that yields G and H of the lowest complexity. The so that the new decision tables are less complex and criteria that guide the selection of such partition are easier to interpret than the original decision table. called partition selection criteria. The decision table decomposition method is based Let us illustrate the decomposition by a simple ex-on function decomposition, an approach originally de- ample (Table 1). The decision table relates the input veloped for the design of digital circuits [2]. The attributes xi, X2, and X3 to the class y, such that method iteratively applies a sinj/erfecom/)05«i«on sie;?, V -F{xx,xi,x-i). There are three possible parti­vvhose goal is to decompose a function ?/ = F(X ) into tions of attributes that yield three different decomposi­2/ =G{A,H{B)), where X is a set of input attributes tions %j = Gi{xx,Hx{x2,X'i)), y = G2{x2,H2{xi,X3)), xi,... ,xn, and y is the class variable. F, G and i ? V= G3{x3,H3{xi,X2)): The first two are given in are functions represented by decision tables, i.e., pos- Figure 1, and the comparison shows that: sibly incomplete sets of attribute-value vectors with assigned classes. A and B are nonempty subsets of ' l' f ^°" ^'^^'^^ ,,'" *^^° decomposition y = input attributes such that AuB = X.The functions f i^"^'^.^^f''""^^l """^ T ' G and H are developed by decomposition and are not . ^^"^ V -G2(x2,H2(xi,X3)), predefined in any way. Such a decomposition also dis-_ t^e new concept a = H^ (a;2,^3) ušes only three covers a new intermediate concept c = H{B). Since ^^lues, whereas that for H2{xi,X3) ušes five, the decomposition can be applied recursively on G and H, the result in general is a hierarciuj of decision ta-— it is hard to interpret decision tables G2 and H2, XI l o l o l o l o l o l o med med med med med med h i h i h i h i h i h i X2 l o l o med med h i h i l o l o med med h i h i l o l o med med h i h i X3 l o h i l o h i l o h i l o h i l o h i l o h i l o h i l o h i l o h i y l o l o l o med l o h i med med med med med h i h i h i h i h i h i h i Table 1: An example decision table. •vvhereas by inspecting Gi and Hi it can be ea3y to see that ci = MIN(a;2,a;3) and y = MAX(a;i,ci). This can be even more evident with the reassign­ment of ci's values: 1 to lo , 2 to med, and 3 to hi . Oi: " 1l o C9 1 y l o ^ } l o l o c i t U uDed l ol o 3 4 l o ned ned l o h i l o S hi n«d med ned 1 l o mad med med 3 med m« d hi D«d 3 ned h i hi ned 4 med h i hi nad S h i h i hI hi 1 l o hi 3 hI hi 3 med hi 4 h i hi G hI ^•? *a c i U l o ned ned h i h i lo h i l o hi l o h i i 1 1 3 1 3 "2­^^ lo l o *ri l o hI c->. 1 mod l o mod hI hI l o m hi ^3 Figure 1: Two different decompositions of the decision table from Table 1. The above comparison indicates that the decompo­sition y = G2ix2,H2(xi,X3)) yields more complex and less interpretable decision tables than the decomposi­tion y — Gi{xi,Hi{x2,X3)). The questions of interest are thus: 1. How do we measure the overall complexity of orig­inal decision table and of the decomposed system? 2. Which are the criteria that can guide the single decomposition step to chose among possible de­compositions? 3. How much Information is contained within the hi­erarchical structure itself? 4. How does interpretability relate to the overall complexity of decision tables in the decomposed system? Is a less complex system also easier to interpret? Some of these questions were already addressed in the area of computer aided circuit design where de­composition is used to find a circuit of minimal com­plexity that implements a specific tabulated Boolean function. There, the methods mostly rely on the com­plexity and partition selection criterion known as De­composed Function Cardinality (DFC, see [21]). How­ever, a question is whether this criterion can be used for the decomposition of decision tables of interest to machine learning, where attributes and classes usu­ally take more than two yalues. Moreover, the main concern of Boolean function decomposition is the min­imization of digital circuit, leaving aside the question of comprehensibility and interpretability of the result­ing hierarchy. This article is organized as follows. The next section reviews related work on decision table decomposition with the emphasis on its use for machine learning. The decomposition algorithm to be used throughout the article is presented in section 3. Section 4 introduces two new partition selection criteria that are based on the Information content of decision tables (DTIC) and on the cardinality of newly discovered concepts (CM). That section also discusses how DFC and DTIC may be used to estimate the overall complexity of derived decision tables, and shows how DTIC may be used to assess the Information content of the discovered hier­archical structure itself. Section 5 experimentally eval­uates the different criteria and complexity measures. Section 6 summarizes the results and concludes the article. 2 Related work The decomposition approach to machine learning was used early by a pioneer of artificial intelligence, A. Samuel. He proposed a method based on a signature table system [22] and successfully used it as an evalu­ation mechanism for checkers playing programs. This approach was later improved by Biermann et al. [3]. Their method, however, did not address the problem of deriving the hierarchy of concepts, which was sup­posed to be given by a domain expert. A similar approach had been defined even earlier vvithin the area of switching circuit design. In 1956, R.L. Ashenhurst reported on a unified theory of de­composition of stvitching functions [2]. The decom­position method proposed by Ashenhurst was used to decompose a completely specified truth table of a Boolean function to be then realized with standard binary gates. Thus, the method could construct con­cept hierarchies as well as their corresponding decision tables. Most of other related work of those times is re­ported and reprinted by Curtis [8]. Recently, the Ashenhurst-Curtis approach was sub­stantially improved by research groups of M. A. Perkowski, T. Luba, and T. D. Ross. In [18], Perkowski et al. report on the decomposition ap­proach for incom,pletely specified switching functions. Luba [12] proposed a method for the decomposition of multi-valued sniitching functions in which each multi­valued variable is encoded by a set of Boolean vari­ables. A decomposition of fc-valued functions was pro­posed by Files et al. [10]. The authors identify the potential usefulness of function decomposition for ma­chine learning, and Goldman [11] indicates that the decomposition approach to switching function design might be termed knoivledge discovertj, since a func­tion not previously foreseen might be discovered. From the viewpoint of machine learning, however, the main drawbacks of these methods are that they are mostly limited to Boolean functions and incapable of dealing with noise. Feature discovery has been at large investigated by constructive induction [14]. Perhaps closest to func­tion decomposition are the constructive induction sys­tems that use a set of existing attributes and a set of constructive operators to derive new attributes. Sev­eral such systems are presented in [13, 19, 20]. Within machine learning, there are other approaches that are based on problem decomposition, but where the problem is decomposed by the expert and not by a machine. A \vell-known example is structured induc­tion, developed by Shapiro [23]. His approach is based on a manual decomposition of the problem. For every intermediate concept either a special set of learning examples is used or an expert is consulted to build a corresponding decision tree. In comparison with stan­dard decision tree induction techniques, Shapiro's ap­proach exhibits about the same classification accuracy with the increased transparency and lower complexity of the developed models. Michie [15] emphasizes the important role the structured induction will have in the future development of machine learning and lists several real problems that were solved in this way. The work presented here is based on our own decom­position algorithm [25] in which we took the approach of Curtis [8] and Perkowski et al. [18], and extended it to handle multi-valued categorical attributes and functions. The algorithm was demonstrated to per­form well in terms of generalization [26], discovery of relevant concept hierarchies [7], and feature construc­tion [27] in fairly complex problem domains. 3 Decomposition algorithm Let F be a decision table consisting of attribute-value vectors that map the attributes X = {xi,... ,a;„} to the class y, so that y = F{X). A single decompo­sition step searches through ali the partitions of at­tributes X into a free set A and bound set B, such that AnB = 6{G) + 6{H) is satisfied. Several partition selection {tj}) and complexity {9) measures are intro­duced in the next section. The algorithm that implements the single decom­position step and decomposes a decision table F to G and H is described in detail in [25]. Here, we illustrate it informalIy using the decision table from Table 1. For every attribute partition, the method constructs a partition matrix with the attributes of bound set in columns and of free set in rows. Each column in the partition matrix denotes the behavior of F for a spe­cific combination of values of bound attributes. The same columns can be represented with the same value of C. The number of different columns is equal to the minimal number of values for c to be used for decom­position. In this way, every column is assigned a value of C, and G and H are straightforwardly derived from such an annotated partition matrix. For each of three partitions for our example decision table F, the par­tition matrices with the corresponding values of c are given in Figure 2. The assignment of c's values is trivial when de­cision table instances completely cover the attribute space. When this is not the čase, Wan and Perkowski [24] proposed an approach that treats missing deci­sion table entries as "don't cares". Each partition ma­trix can then have several assignments of values for C. The problem of finding the assignment that ušes the fewest values is then equivalent to optimal graph coloring. Graph coloring is an NP-hard problem and the computation time of an exhaustive search algo­rithm is prohibitive even for small graphs. Instead, Wan and Perkowski suggested a heuristic Color In­fluence Method of polynomial complexity and showed that the method performed well compared to the op­timal algorithm. Although the examples used in this articie use decision tables that completely cover the at­tribute space, the complexity and partition measures X2 lo lo med med hi hi a;i lo lo med med hi hi Xl X3 lo hi lo hi lo hi X2 X3 lo hi lo hi lo hi lo lo lo lo med lo hi lo lo lo med med hi hi med med med med med med hi med lo med med med hi hi hi hi hi hi hi hi hi hi lo hi med hi hi hi C 1 1 1 2 1 3 C 1 2 3 4 5 5 Xl lo lo lo med med med hi hi hi X3 X2 lo med hi lo med hi lo med hi lo lo lo lo med med med hi hi hi hi lo med hi med med hi hi hi hi C 1 2 3 4 5 5 6 6 6 Figure 2: Partition matrices for Table 1 using three different partitions of attributes a;i, X2, and 13. introduced apply with no difference to incompletely covered cases as well. The decomposition algorithm examines ali decision tables in the evolving concept hierarchy and tlien ap­plies a single decomposition step to the decision ta­ble and its partition that was evaluated as the most appropriate by ip and that satisfies the complexity condition e{F) > Q{G) + 6{H). If several partitions are scored equal, the algorithm arbitrarily selects one among those with the lowest number of elements in the bound set. The process is repeated until no de­composition is found that would satisfy the complexity condition. We illustrate this stepwise decomposition using the CAR domain that is described in section 5. Figure 3 shows a possible evolving concept hierarchy obtained by decomposition. Each consecutive hirarchy is a re­sult of a single decomposition step. Only the hierar­chical structure without decision tables is shown. The overall time complexity of decision table decom­position algorithm is polynomial in the number of ex­amples, number of attributes, and maximal number of columns in partition matrices [26]. As the latter grows exponentially with the number of bound attributes, it is advantageous to limit the size of the bound set. In the experiments presented in Section 5, however, the problems were sufficiently small to examine ali possible bound sets. The above decomposition algorithm was imple­mented in the C language as a part of the system called HINT (Hierarchy INduction Tool) [25]. HINT runs on several UNIX platforms, including HP/UX and SGI Iris. Partitio n selection criteria and complexity measures This section reviews one and introduces two new par­tition selection criteria. For each, it also defines the complexity measure and corresponding complexity condition. Furthermore, two overall complexity mea­sures for the hierarchy of decision tables are defined, and, finally, a measure for estimating the Information content of the hierarchy itself is presented. 4.1 Partition selection criteria 4.1.1 Decompose d funetion cardinalit y Decomposed funetion cardinality (DFC) was originally proposed by Abu-Mostafa [1] as a general measure of complexity and used in decomposition of Boolean functions [21]. DFC is based on the cardinality of the funetion. Given a decision table F{X), DFC-based complexity is defined as: ^DFc(i^) = ||A'|| = ni^'i| ' ""^^^ (1) where [a;, j represents the cardinality of attribute Xj, i.e., the number of values it ušes. The DFC partition selection criterion for decompo­sition F{X) = G{A, c) and c = H{B) is then: ^l>mc{A\B) = ^DFc(C?) + exivc{H) (2) = C \\A\\ + \\B\\ The complexity condition using the above defini­tions is 6DVC{F) > OOFC{G) + ODFC{H), or equiva­lently||X||>|c|||yl| | + ||B|| . For our example decision table (Table 1) and the corresponding partition matrices (Figure 2), the parti­tion selection criteria are: i/'DFc(2:i|3;2a'"3) — 9+6 = 15, ipDFG{x2\'J:iX3) = 15 -I-6 = 21, and 'i/'DFc(a;3|a;i3;2) = 12 + 9 = 21. 6DFC{F) is 18. The only partition that satisfies the DFC decomposition criterion is a;i|a;2a;3. DFC s ability to guide the decomposition of Boolean functions has been illustrated in several references in­cluding [21, 11]. For multi-valued logic synthesis, a DFC-guided decomposition was proposed in [10]. 4.1.2 Informatio n conten t of decision table s Decision table Information content (DTIC) is based on the idea of Biermann et al. [3] who counted the num­ car car buying /maint / / O \ \ ^^*^*y \ lugboot buying maint c l safet y ^ c2 y ^ buying mamt safet y doors persons doors lugboot doors lugboot persons persons doors persons doors persons Figure 3: Evolving concept hierarchy discovered by decomposition of the CAR decision table. Each consecutive hierarchy results from a single-step decomposition of its predecessor. ber of different functions that can be represented by a given signature table schema, i.e., a tree of concepts whose cardinality is predefined. A decision table y = F{X) can represent Iž/l"^" dif­ferent functions. Assuming the uniform distribution of functions, the number of bits to encode such a decision table is then ^DTIc(i^) = ||X||l0g2|2/ | bits (3) Note that for binary functions where \y\ = 2, this is equal to 6DFC{F). When decomposing y = F{X) to y = G{A,c) and C = H{B), we assign a single value from the set {1,2,.. . , |c|} to each of the columns of partition ma­trix. But, each of the values has to be assigned to at least one instance. In other words, from |y|"^" differ­ent functions we have to subtract ali those that use less than \c\ values. The number of different functions with exactly \c\ possible values is therefore N{\c\), where A^ is defined as: x-l N{x) = a;ll^l N(i) -E (4) Nil)= 1 Furthermore, since the actual label (value of c) of the column is not important, there are |c|! such equiv­alent assignments and therefore |c|! equivalent decision tables H. A specific H therefore uniquely represents A'^(|c|)/|c|! functions with exactly |c| values, and the corresponding Information content is: ^DTIc(^)=l0g2iV(|c|)-l0g2(|c|! ) bit s (5 ) The DTIC partition selection criterion prefers the decompositions with simple decision tables G and H and low Information content, so that; V'DTIC(A|B) = ^DTICCG) + e[)TIc(^) (6) The DTIC-based complexity condition is: ^DTIC (F) > ^DTIC (G) + 0[)TIC (H) (7) For Table 1, DTIC evaluates to: V'DTic(a:i|a;2a;3) = 20.76 bits, VDTic(3;2|a:ia;3) = 27.68 bits, and ^DTic(a;3|a;ia;2) = 30.39 bits. 5DTIC(-F) is 28.53 bits, and, in contrast to DFC, two partitions qualify for de­composition. Among these, as with DFC, the partition a;i |a.-22;3 is preferred. 4.1.3 Colum n inultiplicit y Column multiphcity (CM) is the simplest complexity measure introduced in this article and equals to the cardinality of c (|c|), also referred to by Ashenhurst and Curtis as column multiplicity number of partition matrix [2, 8]. Formally, i^cu{A\B) = \c\ (8) The idea for this measure came from practical experience with DEX decision support system [5]. There, the hierarchical system of decision tables is con­structed manually and it has been found that decision tables with small number of output values are easier to construct and interpret. For our example and similarly to DFC and DTIC, CM also selects the partition a;i 1x2x3 with IIJCM — 3. The remaining two partitions have V'CM(3;2|3;ia;3) = 5 and i/'CM(a;3|a;i2;2) =6 . Unlike DTIC and DFC, CM can not be simply summed up to determine the joint complexity of a set of decision tables, which is needed to determine the complexity condition. Consequently, when we employ CM to guide the partition selection, we use DTIC to determine the decomposability. 4.2 Complexity estimation for decision table hierarchy Using DFC, the overall complexity of decision tables in the concept hierarchy is the sum of ^DFC for each decision table. Similarly, for DTIC, the complexity estimation is again the sum.of DTIC complexities of each of the decision tables, with the distinction that ^DTic is used for the decision table at the root of the hierarchy and ^DTIC ^^^ ^^^ other decision tables. For example, consider the two concept hierarchies from Figure 1. Their overall complexities as measured by DFC are 15 and 21, respectively, and 20.76 bits and 27.68 bits as measured by DTIC. These measures confirm that the first decomposition is less complex and thus preferred to the second one. The original un­decomposed decision table had DFC equal to 18 and DTIC equal to 28.53 bits. Therefore, in terms of DTIC both decompositions reduced the complexity, while us­ing DFC this happened only with the first one. Note that the so-obtained DTIC complexity estima­tion is just an approximation of the exact complex­ity that would take into account the actual number of functions representable by a multi-level hierarchy. This is because DTIC is designed for a single table only and does not consider the reducibility [3] that oc­curs in multi-level hierarchies and effectively decreases the number of representable functions. Therefore, the estimated overall DTIC is the upper bound of the ac­tual complexity. 4.3 Structure information content Using DTIC we can assess both the amount of informa­tion contained in the original decision table and con­tained in the resulting decision tables that were con­structed by decomposition. The difference of the two is the information contained in the hierarchical struc­ture itself. We call this measure structure information content (SIC). The more informative the hierardiy, the overall less complex the resulting decision tables. B. Zupan et al. For the two decompositions in Figure 1, the corre­sponding structure information contents are 7.77 bits and 0.85 bits, respectively. Since the first SIC is con­siderably greater than the second one, the first struc­ture is more informative and its decision tables more compact. 5 Experimental evaluation To evaluate the proposed partition selection criteria and complexity measures, we used three artificial and three real-world domains that were selected so that their concept hierarchies vvere either known in advance or could have been easily anticipated. For each do­main, the decomposition aimed to discover this hierar­chy. For evaluation, we qualitatively assess the similar­ity of the two hierarchies and quantitatively compare them by using the proposed complexity measures. Each of six domains is represented with the ini­tial decision table containing instances that completely cover the attribute space. Although the experiments could as well be done with sparser decision tables (see [25]), we wanted to focus in this article only on the dis­covery of concept hierarchies. Note that the proposed partition selection measures depend only on cardinal­ities of attributes and concepts, and not on the actual number of instances in decision tables. Furthermore, we have shown in [26] that by increasing the prob­lem space coverage by training instances, the discov­ered concepts converge to those from complete training sets. The results of decompositions are shown as concept hierarchy structures, where, unless otherwise noted, the labels of intermediate concepts indicate the order in which they were discovered. 5.1 Artificial domains Three artificial domains vvere investigated: 1. a Boolean function y = (xi OR X2) AND X3 AND (14 XOR sg), 2. a six-attribute palindrome function, 3. a three-valued function 2/= MIN(a;i, AVG(a;2,MAX(a;3,X4),X5)). For the first function, the initial decision table has 2^ = 32 instances, 6IDFC = 32 and 6IDTIC = 32 bits. While decomposition with DTIC and CM discovered the anticipated hierarchy, the DFC-guided decomposi­tion terminated too soon because the complexity con­dition did not allow to decompose the decision tables any further (see Figure 4). Note that the overall DFC is the same for ali discovered hierarchies, while the structure information content is higher for those dis­covered by DTIC and CM. The decision tables (not DFC = 16 2//2 DTIC = 12.42 bits SIC = 19.58 bits cl/2 c3/2 DFC = 162:3/2 c2/2 0:4/2 15/2 DTIC = 14.99 bits 3:1/2 0:2/2 0:3/2SIC = 17.01 bits a;i/2 0:2/2 Figure 4: Decomposition of decision table representing the function y = (a:i OR 0:2) AND 0:3 AND (0:4 XOR 0:5) guided by DTIC and CM (left), and DFC (right). DFC = 20 y/2 2//2 DTIC = 15.23 bits SIC = 48.77 bits cl/ 2 c2/2 2:3/2 0:4/2 cl/ 2 c3/2 c4/2 0:3/2 0:4/2 0:2/2 0:5/2 c2/2 DFC = 20 / \ / \ DTIC = 17.80 bits ^ ^ a:i/2 XQ/2 X2/2 0:5/2 SIC =•• 46.20 bits a:i/2 0:5/2 Figure 5: Decomposition of decision table representing the palindrome function guided by DTIC and CM (left), and DFC (right). 2//3 1f/3 2//3 XxlZ cl/ 3 0:1/3 c2/3 0:1/3 c3/3 0:2/3 c2/3 0:5/3 c3/5 cl/ 3 0:5/3 cl/ 5 0:3/3 0:4/3 0:2/3 0:5/3 0:3/3 0:4/3 0:2/3 c2/3 DFC = 45 DFC = 42 0:3/3 0:4/3 DTIC = 66.04 bits DTIC = 59.77 bits SIC = 319.11 bits 325.38 bits sic = . Figure 6: Decompositions of the function y = MIN(o:i, AVG(o:2,MAX(a;3,o:4),o;5)): the anticipated hierarchy (left), the hierarchy discovered using CM (middle), and DFC and DTIC (right). The complexity and information measures for the latter two decompositions are the same. shown in the figure) were checked for interpretability and were found to represent the expected functions. The second function y = PAL(a;i,a;2,-•• ,X6) re­turns 1 if the string a;i.. . ^e is a palindrome and re­turns O othervvise, i.e., y = {xi = xe) AND (0:2 = zs) AND (3:3 = 14). In the first experiment, six Boolean attributes xi .. .XG were used. The initial decision table has ^DFC = 64 and ^DTIC = 64 bits. Again, the decomposition with DFC stops sooner and the domain favors the decomposition using CM and DTIC. However, for both this and previous čase a DFC-guided decomposition could discover the ex­pected hierarchy if the corresponding complexity con­dition would be changed to ^DFC(^ ) > ^DFC(G ) + ^DFc(-f^)- The same experiment was repeated with three-valued attributes ari.. .X6- This tirne, however, aH three criteria lead to the same and anticipated con­cept hierarchy. The third function y = MIN(a;i, AVG(x2, MAX(a;3,a;4),a;5)) ušes ordinal attributes Xi...X5 that can take the values 1, 2, and 3. While MIN and MAX have the standard interpretation, AVG com­putes the average of its arguments and rounds it to the closest integer. The initial decision table has 6'DFC = 243 and 6IDTIC - 385.15 bits. The antici­pated and discovered hierarchies are shown in Figure 6. Quite surprisingly, in ali three cases the decomposition yields a hierarchy with a higher structure Information content than expected by introducing an additional five-yalued intermediate concept. If this were removed, the discovered hirarchy and decision tables would have been the same as anticipated. It is also interesting to note that the hierarchy discovered using CM on one side and DFC or DTIC on the other are different but of the same complexity. This example illustrates that for a specific domain there may exist several optimal concept hierarchies with regard to complexity. 5.2 DE X model s An area where concept hierarchies have been used ex­tensively is decision support. There, the problem is to select an option from a set of given options so that it best satisfies the aims or goals of the decision maker. DEX [5] is a multi-attribute decision support system that has been extensively used to solve real-world decision making problems. DEX ušes categori­cal attributes and expects the concept structure and corresponding decision tables to be defined by the ex­pert. The formalism used to describe the DEX model and its interpretation are essentially the same as with concept hierarchies studied in this article. This makes decision models developed by DEX ideal benchmarks for the evaluation of decision table decomposition. In this article, we use the following three DEX models: CAR: A model for evaluating cars based on their priče and technical characteristics. This simple model B. Zupan et al. was developed for educational purposes and is de­scribed in [4]. EMPLOY: This is a simplified version of the mod­els that were developed with DEX for a common problem of personnel management: selecting the best candidate for a particular job. While the realistic models that were practically used in sev­eral mid- to large-size companies in Ljubljana and Sarajevo consisted of more than 40 attributes, the simplified version ušes only 7 attributes and 3 in­termediate concepts and was presented in [6]. NURSERV: This model was developed in 1985 to rank applications for nursery schools [17]. It was used during several years when there was excessive en­rollment to these schools in Ljubljana, and the rejected applications frequently needed an objec­tive explanation. The final decision depended on three subproblems: (1) occupation of parents and child's nursery, (2) family structure and financial standing, and (3) social and health picture of the family. The CAR and NURSERV datasets are available from the UCI Machine Learning Repository [16]. The goal of this experiment was to reconstruct these DEX models from examples. The learning instances were derived from the original models, vvhere for aH combinations of input attributes the class was deter­mined by the corresponding model. The examples were stated as attribute-value vectors, hiding from the decomposition method any underlying conceptual structure of the domain. The discovered hierarchies are given in Figures 7, 8, and 9. In ali cases, the decomposition guided by DFC, DTIC, and CM found the same hierarchical structures and corresponding decision tables. Using DFC and DTIC, the order in which new intermediate concepts were found was the same but different to the one us­ing CM. For example, in EMPLOV, DFC and DTIC-guided decomposition discovered cl first, vvhile, using CM, this concept was discovered as the last one. AH the discovered hierarchies have higher Informa­tion content than the original ones. Also, the over­all complexity of decision tables is lower according to both DFC and DTIC. Most importantly, the discov­ered concept hierarchies are very similar to the origi­nal ones. In fact, if c3 would be removed from CAR (making c4 directly dependent on lugboot, doors, and persons), the two Jiierarchies would be the same. The same applies to EMPLOV and NURSERV if cl and c2 are removed, respectively. In other words, the de­composition found the same concept hierarchies as the original ones but additionally decomposed the deci­sion tables for comfort (CAR), employ (EMPLOV), and struct+f inan (NURSERV). In this way it obtained less complex decision tables. car/4 car/4 price/4 tech/4 c2/4 cl/4 buying/4 maint/4 comfort/4 safety/3 buying/4 maint/4 c4/3 safety/3 lugboot/3 doors/4 persons/3 lugboot/3 c3/4 DFC = 77 DFC = 65 doors/4 persons/3 DTIC = 126.75 bits DTIC ^ 107.90 bits SIC = 3329.25 bits SIC = 3348.10 bits Figure 7: The original concept hierarchy of CAR (left) and the decompositions based on CM, DFC and DTIC (right). einploy/4 employ/4 educat/3 exper/5 / \ per-char/3 for_lang/3 \ age/5 c5/3 /x c2/3 degree/5 for_lang/3 intel/ 4 work_app/3 degrež/5 / exper/5 age/5 DFC = 91 DFC = 85 coDrai/4 nianag/3 DTIC = 145 bits DTIC = 128 bits coinm/4 manag/3 SIC = 35855 bits SIC = 35872 bits Figure 8: The original concept hierarchy of EMPLOY (left) compared to the hierarchy discovered by CM, DFC, and DTIC-guided decomposition (right). nursery/5 nursery/5 soc+health/3 c4/4 c5/3 employ/4 health/3 health/3 parents/3 cl/ 3 struct+finan/3 parents/3 social/ 3 social/ 3 has_nurs/5 has_nurs/5 finance/2 / housing/3 c3/3 c2/3 structure/3 form/4 \ \ housing/3 DFC = 94 form/4 childs/4 I3PQ _ 82 childs/4 finance/2 DTIC = 169.20 bits DTIC = 132.95 bits SIC = 29922.99 bits SIC = 29959,24 bits Figure 9: The original (left) and discovered concept hierarchy using CM, DFC and DTIC criteria (right) for NURSERY. The derived decision tables were compared to the original ones and found to be the same but in the names used for instance labels (the decomposition ušes abstract labels while the original decision tables use meaningful names). The only exception are decision tables for tech and comf ort in the CAR domain, where the decomposition succeeded to find a more compact representation. 6 Conclusion We investigated the appropriateness of three partition selection measures for decision table decomposition: decision table Information content (DTIC) and col­umn multiplicity (CM) introduced in this article, and decomposed function cardinality (DFC) that has al­ready been used primarily for the decomposition of Boolean functions. The experimental evaluation exposed the deficiency of DFC when decomposing a decision table that ex­presses a Boolean function. This may be alleviated by relaxing the DFC compIexity condition. In more complex domains with multi-valued attributes, the de­composition guided by any of the proposed criteria discovered concept hierarchies that were very similar to those expected. Furthermore, the discovered hi­erarchies were equal to or even better than the an­ticipated ones in terms of the complexity of decision tables and structure Information content. The order under which the intermediate concepts were discov­ered was the same for DFC and DTIC, but different for CM. A qualitative evaluation of derived hierarchies reveals that, in general, the discovered decision tables represent meaningful and interpretable concepts. Although less complex in definition and easier to compute, DFC and CM both stand well in compari­son with a more complex partition selection measure DTIC. Also comparable is the utility of DFC and DTIC to assess the complexity of the original and de­rived decision tables, although we have shown that DFC-based measure performed worse on two Boolean functions. Overall, while DFC and DTIC have better theoretical foundations than an intuitive partition se­lection measure CM, the experimental evaluation does not indicate that any of these is to be strictly preferred over the other. The decision table decomposition was primarily de­veloped for switching circuit design. However, ex­periments in non-trivial domains like DEX's strongly encourage further research and development of this method for machine learning and knowledge discov­ery. As the method has recently been extended to deal with continuous attributes [9] and noise [25], further research is needed to assess the quality of correspond­ing partition selection criteria under these extensions. B. Zupan et al. References [1] Y. S. Abu-Mostafa. Complexity in Information Theonj. Springer-Verlag, New York, 1988. [2] R. L. Ashenhurst. The decomposition of switching functions. Technical report, Bell Laboratories BL­1(11), pages 541-602, 1952. [3] A. W. Biermann, J. Fairfield, and T. Beres. Sig­nature table systems and learning. IEEE Trans. Syst. Man Cijbem., 12(5):635-648, 1982. [4] M. Bohanec and V. Rajkovič. Knowledge acquisi­tion and explanation for multi-attribute decision making. In 8th Intl Workshop on Expert Sijs­tems and their Applications, pages 59-78, Avi­gnon, France, 1988. [5] M. Bohanec and V. Rajkovič. DEX: An ex­pert system shell for decision support. Sistemica, 1(1):145-157, 1990. [6] M. Bohanec, B. Urh, and V. Rajkovič. Evaluating options by combined qualitative and quantitative methods. Acta Psijchologica, 80:67-89, 1992. [7] M. Bohanec, B. Zupan, I. Bratko, and B. Cestnik. A function decomposition method for develop­ment of hierarchical multi-attribute decision mod­els. In Proč. 4th Confercnce of the International Societij for Decision Support Sijstems (ISDSS-07), pages 503-514, Lausanne, Switzerland, July 1997. [8] H. A. Curtis. A New Approach to the Design of Smitching Functions. Van Nostrand, Princeton, N..I., 1962. [9] J. Demšar, B. Zupan, M. Bohanec, and I. Bratko. Constructing intermediate concepts by decompo­sition of real functions. In M. van Someren and G. Widmer, editors. Proč. European Confercnce on Machine Learning, ECML-97, pages 93-107, Prague, April 1997. Springer. [10] C. Files, R. Drechsler, and M. Perkowski. Func­tional decomposition of MVL functions using multi-valued decision diagrams. In International Sijmposium on Multi-Valued Logic, may 1997. [11] J. A. Goldman. Pattern theoretic knowledge dis­covery. In Proč. the Sixth Int'1 IEEE Confercnce on Tools with Al, 1994. [12] T. Luba. Decomposition of multiple-valued func­tions. In 25th Intl. Symposium on Multiple-Valued Logic, pages 256-261, Bloomigton, Indiana, May 1995. [13] R. S. Michalski. A theory and methodology of inductive learning. In R. Michalski, J. Carbon­nel, and T. Mitchell, editors, Machine Learning: An Artificial Intelligence Approach, pages 83-134. Kaufmann, Paolo Alto, CA, 1983. [14] R. S. Michalski. Understanding the nature of learning: Issues and research directions. In R. Michalski, J. Carbonnel, and T. Mitchell, edi­tors, Machine Learning: An Artificial Intelligence Approach, pages 3-25. Kaufmann, Los Atlos, CA, 1986. [15] D. Michie. Problem decomposition and the learn­ing of skills. In N. Lavrač and S. Wrobel, edi­tors, Machine Learning: ECML-05, Notes in Ar­tificial Intelligence 912, pages 17-31. Springer-Verlag, 1995. [16] P. M. Murphy and D. W. Aha. UCI Repository of machine learning databases [http://vvww.ics.uci .edu/ "mlearn/mlrepository.litml]. Irvine, CA: University of California, Department of Information and Computer Science, 1994. [17] M. Olave, V. Rajkovič, and M. Bohanec. An ap­plication for admission in public school systems. In I. Th. M. Snellen, W. B. H. J. van de Donk, and J.-P. Baquiast, editors, Expert Systems in Puhlic Administration, pages 145-160. Elsevier Science Pubhshers (North Holland), 1989. [18] M. A. Perkovvski et al. Unified approach to functional decompositions of switching functions. Technical report, Warsaw University of Tech­nology and Eindhoven University of Technology, 1995. [19] B. Pfahringer. Controlling constructive induction in CiPF. In F. Bergadano and L. De Raedt, ed­itors, Machine Learning: ECML-94, pages 242­ 256. Springer-Verlag, 1994. [20] H. Ragavan and L. Rendell. Lookahead feature construction for learning hard concepts. In Proč. Tenth International Machine Learning Confer­ence, pages 252-259. Morgan Kaufman, 1993. [21] T. D. Ross, M. J. Noviskey, D. A. Gadd, and J. A. Goldman. Pattern theoretic feature extrac­tion and constructive induction. In Proč. ML­COLT '94 Workshop on Constructive Induction and Change of Representation, New Brunswick, New Jersey, July 1994. [22] A. Samuel. Some studies in machine learning us­ing the game of checkers II: Recent progress. IBM J. Res. Develop., 11:601-617, 1967. [23] A. D. Shapiro. Structured induction in expert sys­tems. Turing Institute Press in association with Addison-Wesley Publishing Company, 1987. [24] W. Wan and M. A. Perkowski. A new approach to the decomposition of incompletely specified func­tions based on graph-coloring and local transfor­mations and its application to FPGA mapping. In Proč. of the IEEE EURO-DAC '92, pages 230­235, Hamburg, September 1992. [25] B. Zupan. Machine learning based on func­tion decomposition. PhD thesis, University of Ljubljana, April 1997. Available at http://www­ai.ij.s.si/BlazZupan/papers.html. [26] B. Zupan, M. Bohanec, I. Bratko, and J. Demšar. Machine learning by function decomposition. In Jr. D. H. Fisher, editor, Proč. Fourteenth Interna­tional Conference on Machine Learning (ICML­97), pages 421-429, San Mateo, CA, 1997. Mor­gan Kaufmann. [27] B. Zupan, M. Bohanec, ,J. Demšar, and I. Bratko. Feature transformation by function decomposi­tion. IEEE Intelligent Sijstems & Their Appli­cations, 13(2):38-43, March/April 1998.