Benchmarking Indicates Relevance of Multiple Knowledge Matjaž Gams Jožef Štefan Institute, Jamova 39, 61000 Ljubljana, Slovenia Phone: +386 61 125 91 99, Fax: +386 61 219 677 E-mail: matjaz.gams@ijs.si Keywords: artificial intelligence, multiple knowledge, multistrategy learning Edited by: Anton P. Železnikar Received: March 17, 1994 Revised: November 11, 1994 Accepted: November 15, 1994 Over the last 7 years, detailed measurements of available learning systems were per­formed on two real-life medical domains with the purpose to verify the importance of multiple knowledge. The performance of the combined system GINESYS, consisting of an artificial intelligence and a statistical method, was analysed with and without multiple knowledge and by varying the number of learning examples, the amount of artificially added noise, the impurity and the error estimate functions. These mea­surements and those of other researchers indicate that multiple knowledge can provide essential improvements. Measurements also indicate that improvements over "one-level" or monostrategy knovrledge representation representations are quite common in real-life noisy andincomplete domains. Introduction In easing the bottleneck of knowledge acquisi­tion in expert svstems (Harmon et al. 1988), automatic knowledge construction from examples has proven useful in many practical tasks. Quite often, examples are described in terms of attribu­tes and their values and each example belongs to a certain class. The task of the svstem is to induce concept descriptions from examples. First sv­stems were designed for exact domains like chess end-games and constructed trees (ID3 - Quinlan 1983) or lists of rules (AQ11 - Michalski & Lar­son 1983). But in many real-life domains (Gams & Karalič 1989), because of noise or incomplete description (Manago h Kodratoff 1987) speciali­sed mechanisms have to be applied. In noisy do­mains, longer rules (or longer branches in trees) perform better on learning examples while trun­cated rules (pruned tees with shorter branches) perform better on unseen examples. On the ba­sis of this principle, the second group of inductive svstems emerged (CART - Breiman et al. 1984; Quinlan 1987). Around five years ago the third group of systems began emerging (GINESYS -Gams 1988; 1989; LOGART - Cestnik, Bratko 1988; new CN2 - Clark & Boswell 1991), based on the explicit use of multiple knowledge.1 Each of these groups of systems usually achieves better performance than previous. Better performance of multiple knowledge systems was especially no­ticeable in classification accuracy, also in better comprehensibility (although more difficult to me­asure) when compared to the other two groups. At the same tirne, their efficiency remained simi­lar to those in the second group. With measurements presentedin this paper we give additional arguments for successfulness of multiple knowledge by explicitly measuring the influence of the number of learning examples and the influence of noise, as well as the influence of the error estimate and impurity functions. Ben­chmarking was performed on two often used do­mains - lymphography and primary tumor (Clark & Niblett 1989; Michalski et al. 1986; Cestnik & Bratko 1988; Gams 1988). AQ15 - Michalski et al. 1986; ASSISTANT - Ko­ 'By 'multiple knowledge' we refer to multiple models, nonenko 1985; CN2 - Clark & Niblett 1989; C4 -multiple systems or multiple methods. Here we present results of benchmarking over a period of 7 years. Testing was performed al­ways on the same two oncological domains. Al­together, around 20 systems were benchmarked. Our system GINESYS was constructed on the ba­ses of first benchmarking of around 10 systems in 1987 from a frustration since statistical systems have regularly achieved better accuracy than sin­gle Al systems. GINESYS is described in Section 3, benchmarking in Sections 4 and 5. 2 Multiple Knowledge and Multistrategy Learning Even first expert systems like MYCIN (Shortliffe 1976) and most rule-based systems already ena­bled a certain amount of multiplicity, i.e. redun­dancy, since rules can be more or less multiple. Newer systems like CN2 (Clark & Niblett 1989) or C4 (Quinlan 1987) contain similar amount of redundancy which is probably one of the reasons for their successful behaviour in noisy domains. In (Catlett & Jack 1987) it was reported that con­structing a separate decision tree for each class with the same method as when constructing one decision tree for ali classes significantly increased accuracy. Similar conclusion was derived by Clark & Boswell (1991) when constructing several lists of rules and by Buntine (1989) when combining 5 decision trees with different roots. In Communications, the positive efFect of using redundant bits is known for decades and even sim­ple ID numbers in banking have additional digits in order to improve the robustness of the whole sy­stem. Theoretical aspects of redundancy in such cases are described e.g. in (Shannon & Weaver 1964). In most every-day activities, people use multi­ple knowledge whenever there is any possibility of biasing (Utgoff 1986). For example, when hiring a new employee, one checks several reports which are basically multiple (e.g. biography, recommen­dations etc) . When bringing an important deci­sion, humans often discuss possibilities in groups of relevant people. A council of physicians is con­sulted when dealing with difficult or important cases. One physician suffices for most of normal activities since one is substantially cheaper than a group of them. It is commonly accepted that cross-checking of M. Gams several knowledge sources is generally better than using one source of knowledge alone. Humans are intrinsically multiple. They apply multiple strategies in every-day activities without paying much attention to that phenomenom. Therefore, machine and human multistrategy learning have natural interrelationship and potential benefits in both directions. In recent years there were several distinguished events related to mutistrategy learning. Among them: a book edited by R. Michalski & G. Te­cuci: Machine Learning, A Multistrategy Appro­ach, Vol. IV, Morgan Kaufmann (1994), speci­alised international workshops on multistrategy learning organised by George Mason Universitv, special issue of Informatica (Tecuci 1993), and IJCAI-93 workshop on integration of machine le­arning and knowledge acquisition (Tecuci, Kedar & KodratofF 1993). 3 GINESYS GINESYS (Generic INductive Expert SYstem Shell) is one of the oldest systems actively uti­lising multiple knowledge representations (Gams 1988). It consists of two systems (i.e. methods), one from Al and one from statistics. There were sensible reasons for combining methods from di­fferent fields. First of ali, artificial intelligence methods enable construction of knowledge bases which are typically very transparent and under­standable; therefore, it was hoped that a combi­nation would stili be more understandable than statistical knowledge bases. A statistical method was chosen on the basis of the hypothesis that knowledge representations should be as different as possible. GINESYS utilises two different strategies on the basis of these two svstems: the Al svstem constructs and consults lists of rules, and the statistical system multiplies probabilities accor­ding to the distribution of classes corresponding to each attribute of the tested example. Both single systems already implicitly utilise multiple knowledge - the Al part through a couple (typi­cally 5) of rules attached to the main one which are triggered when classifying, and the statistical part through combining probabilities relating to the value of each attribute of the tested example. The Al part of GINESYS is named INESYS d : = (0) ; (*d i s initialised*) repeat Star := (NP); BestRules := (NP); repeat for ali Rulej from Star generate ali specialisations NetvRulei that do not fulfil the stopping criteria; Star := (); put into Star at best MAXSTAR the best NeivRulei evaluated by user defined impurity function; from rules in BestRules and significant NewRule{ choose the best MAXBEST rules, evaluated by user defined error estimate function, and put them into BestRules until Star is empty; add BestRules into d; L := L - examples, covered by the best evaluated rule from BestRules until L is empty The INESYS algorithm (see top of the page). It reimplements many of mechanisms of the ID3 and AQ (CN2) family of algorithms. It was primarily designed as an at­tempt to fully simulate the family of ID3 and AQ inductive empirical learning systems (Gams k Lavrač 1987). Theoretically, it simulates NM different algorithms where M is the number of modules of the algorithm and N is the number of variations of each module (Gams 1989). The ac­tual number of different variations of GINESYS can be estimated to severa! hundreds. INESYS constructs rules with a beam search over ali possible combinations of attributes. In addition, it utilises several search-guiding and error-estimate functions such as informativitv, the Gini index, Laplacean error estimate and signi­ficance. Due to elaborate mechanisms for noise handling, INESYS typically constructs a small number of short rules, i.e. with a small number of attributes. For example, on average, 5.1 main rules with 1.4 attributes in a rule were construc­ted in lymphography. In primary tumor, there were 11.0 main rules with 2.3 attributes in a rule. Therefore, a typical rule had the form: if (Ai = Vij)fc(Ak = Vki) then Distributionn where — (A{ = Vij) is a Boolean test whether attribute i has value j , and — Distributionn is a class probability distribu­tion corresponding.to.the cpndition part of the rule, i.e. a complex. A general description of INESYS is: repeat construct Rule(s); add Rule(s) to d; L := L - ExamplesCoveredByRule until satisfiable d where L is the set of learning examples, d is constructed knowledge in the form of trees or lists of rules and Rule(s) is one or many branches in a tree or one or many rules in the list of rules. A procedural description of INESYS is presented at the top of the page where d is the constructed knowledge in the form of an ofdered list ofordered lists of rules, Star and BestRules are ordered lists of rules and L is the set of learning examples. NP is a rule with an uninstantiated complex and class probability distribution of L. In INESYS, the main improvement regarding existing rule-based systems are rules attached to the main rule. The aim of these multiple rules is twofold. First, to give the user more rules and thus more opportunities to analyse the laws,of the domain. Second, to improve classification accu­racy by cross-checking the matched rule with con­firmation rules. This mechanism already enables the use of multiple knowledge to a certain degree: i f Complexl the n Class l (Complexll the n Classl l ComplexlR the n Class1R) els e i f Complex2 the n Class 2 (Complex2i the n Class21 Complex2R the n Class2R) Classification in INESYS starts by sequentially checking main rules. When the first main rule matches a new example, corresponding multiple rules tha t match the new example add their class probability distribution according to the formula for the union of independent events Pi2 = P\ + (1 - Pi) X p2. Probabilities are multiplied by error estimates in order to calibrate the effect of rules with diffe­rent credibility, and finally normalised. There are two threshold parameters that present a heuristic estimate of the goodness of classification by a rule: the smallest necessary percentage of the majority class (MINACC) and the smallest difference be­tween the percentage of the majority class and the second to majority class (MINDIFF). Each con­structed rule in GINESYS has to satisfy both con­ditibns. Parameter MINDIFF additionally affects the classification process in the sense that the class probability distribution of a combined main and confirmation rules must satisfy it. The second method in GINESYS is the appro­ximation of the Bayesian classifier which assumes independence of attributes. It is often referred to as "naive Bayes"(Good 1950), in this paper also "Bayes". Naive Bayes constructs ali possible ru­les with only one attribute in the complex. The­refore, the form of these rules is: if (A, = Vij) then Distributionn. The classification schema is as follows: ali rules, that match a new example, are taken in considera­tion. The probability of each class c is computed by the following formula: P(c\A) = P.(c) X {P{Ax\c)lPa{A{)) x .. . x(P(Av\c)/Pa(Av)) (Eq.l) where P(c|A) denotes probability of class c given attributes and values A of the tested example, Pa(c) denotes the a priori probability of class c, P(Ai\c) the probability that attribute A{ has the same value as the classified example regarding the class c, Pa(Ai) the same as before, but regardless of class, and v is the number of attributes. By calculating probabilities of ali classes by (Eq.l.), a class probability is obtained. Therefore, although naive Bayes constructs rules similar to INESYS, in the process of classification ali attributes are considered in Bayes and on average only around two in INESYS. Cooperation between the Al and the statistical system is relevant only when they propose diffe­rent classes. In that čase, the goodness of trigge­red rules in INESYS is estimated by the simple he­uristics mentioned above. If the goodness of com­bined rules exceeds the value of a given threshold (parameter MINDIFF), classification by INESYS is adopted. Otherwise, the classification by naive Bayes prevails. In other words: If class proba­bility distribution of combined rules is estimated as unreliable, the statistical method is called as a supervisor to decide which class is estimated as the most probable. The combining schema is based on the follo­wing reasoning: When multiple rules confirm the main ones, classification is very likely to be cor­rect. If a significant disagreement occurs then the list of rules is not credible and the other method using difFerent knowledge representation should be consulted. It was expected that short rules constructed by INESYS will be more successful when they have high confidence in their predic­tion, and the approximation of the Bayesian clas­sifier to be more successful when dealing with di­fficult cases where truncated rules capturing the main and most important laws of the domain are not predicting with great certainty. 4 Benchmarking Since 1987, systematic measurements are being performed on two oncological domains, lympho­graphy and primary tumor. Data were obtained from real patients from the Oncological institute Ljubljana (Kononenko 1985; Cestnik & Bratko 1988). Unknown values of attributes were re­SYSTEM LYMPHOGRAPHY PRIMARY TUMOR class.acc. no.rules no.att. class.acc. no.rules no.att. GINESYS* 70.5 5.1 7 52.2 11.0 25 GINESYS 70.5 5.1 7 52.0 11.0 1 25 BAYES 68.6 56.0 56 50.1 37.0 37 CN2-newl 68.7 50.3 GB* 67.4 5.1 7 47.6 11.0 25 CN2-newl' 65.6 46.9 NEAREST NEIG. 72.9 40.4 C4.5-rules 64.7 38.2 C4.5-trees-u 63.1 48.9 C4.5-trees-p 66.7 48.8 CN2-likel 67.3 4.8 8 48.7 11.4 27 CN2-likel' 66.1 5.0 6 45.6 10.8 22 ID3-like 61.8 25.0 110 48.7 28.6 129 CN2-like2 66.8 10.8 21 45.7 19.3 70 CN2-like2' 65.0 9.4 16 46.2 19.4 68 AQ-likel 60.6 7.0 80 48.8 16.0 423 AQ-like2 55.2 7.0 80 32.0 16.0 423 Table 1: Benchmarking systems on two oncological domains. placed by the most common values regarding the class. 4.1 Domain Description Basic statistics of the whole set of data are: LYMPHOGRAPHY 18 attributes 2 - 8 (average 3.3) values per attribute 9 classes 150 examples distribution: 2 1 12 8 69 53 1 4 0 ali examples differ even if one attribute is deleted PRIMARY TUMOR 17 attributes 2 - 3 (average 2.2) values per attribute 22 classes 339 examples distribution: 84 20 9 14 39 1 14 6 0 2 28 16 7 24 2 1 10 29 6 2 1 24 75 examples in the data set have another example with the same values of attributes and different class; if we delete one attribute, this number is: 114 111 81 122 84 75 93 79 97 91 77 83 76 77 79 94 94 4.2 Benchmarked Systems On the benchmark domains, around 20 Al and statistical systems were compared over more tha n half of a decade. Ali the systems were given the same set of 10 random distributions of data, each tirne taking 70% of dat a for learning and 30% of data for testing. Results of relevant systems are presented in Table 1. The rov/ in the middle of the Table divides multiple and single systems, i.e. those tha t use only one rule or combine many ru­les during one classification. GINESYS* is a version of GINESYS using "ne­gation" multiple rules, which tr y to confront the main rule if possible. BG* is GINESYS* without the statistical method, i.e. INESYS with func­tions B and G. First nearest neighbour algori­thm classifies with the class of the nearest nei­ LYMPHOGRAPHY PRIMARY TUMOR FUNCTIONS INESYS** GINESYS** FUNCTIONS INESYS** GINESYS** AB 68.4 69.7 BA 48.3 52.3 GB 67.4 69.9 BG 48.1 51.8 BB 66.4 70.8 GB 47.6 52.0 BG 62.6 68.4 AB 46.6 51.3 BA 62.4 68.4 BB 46.4 52.5 Table 2: Accuracv under different impuritv and error estimate functions. ghbour where distance is measured by the number of attributes with different values. BAYES is an approximation of the Bavesian classifier using an assumption tha t attributes are independent. ID3­like is a version of the ASSISTANT svstem using cross-validation pruning. CN2-like systems are di­fferent modifications of the CN2 algorithm, and CN2-new systems are latest versions. C4.5-rules constructs rules, C4.5-trees-u unpruned trees, and C4.5-trees-p pruned trees. AQ-like systems are modifications of the AQ15 systems. Classification accuracy (column 1 in each do­main in Table 1) was measured as an average per­centage of correct classifications in ten test runs. The second column in each domain represents the average number of rules in a rule list or branches in the tree. The third column is a product of the number of rules (branches) times the average length of a rule (branch) times the number of in­ternal disjunctions. The relations between systems are similar to those observed in other measurements (Clark & Niblett 1989; Rendell et al. 1987; Rendell et al. 1988). Systems of the AQ family usually achieve lower classification accuracy than CN2 or ASSIS­TANT, while ASSISTANT and CN2 achieve simi­lar classification accuracy. AQ-likel represents an estimate of the upper possible classification accu­racy of the rules, constructed by the AQ-like sy­stem. BAYES achieved better results than other systems except GINESYS. Nearest neighbour al­gorithm seems to be very domain dependent. GI­NESVS achieved the best average classification accuracy over both domains. AQ-like systems construct more complex ru­les than other systems. However, the third co­ lumn might be misleading for tree constructing algorithms like ID3-like because it represents tree as a list of separated branches. GINESYS* and GINESYS are measured only by the main rules and not by the multiple ones. On the other side, from the results in Table 1 it follows tha t systems like GINESYS and CN2 construct smaller num­ber of shorter main rules while AQ-like systems construct more complex rules. The efficiency of the benchmarked algorithms was also analysed. AQ systems are about an or­der of magnitude slower than ASSISTANT, CN2 and GINESYS, and these are about an order of magnitude slower than BAYES. Results are simi­lar to other measurements when having in mind that our versions of CN2 and GINESYS use a data compression mechanism which speeds up the algo­rithm roughly five times. GINESYS PC , another version of GINESYS, runs on IBM PC computers and is available as a free scientific software. 4.3 Varying Impurity and Error Estimate Functions In order to verify whether improvements in GI­NESVS were caused by multiple knowledge or by domain-dependent parameters, several para­meters were varied, and functions were the first among them. GINESYS ušes two different gro­ups of functions: informativity functions and er­ror estimate functions. Informativity functions strategically guide search by trying to determine the amount of impurity. Error estimate functions try to estimate classification error. Four func­tions were used in ali 16 possible combinations in each domain. Classification accuracy of GI­ LYMPHOGRAPHY % OF LEARNING EXAMPLES SYSTEM 20% 30% 40% 50% 60% 70% 80% GINESYS 52,8 58.2 63.1 63.7 60.1 70.5 75.3 BAYES 52.8 59.3 60.8 61.2 58.2 68.6 72.1 INESYS 39.2 51.7 54.1 62.6 59.0 67.4 74.3 ASSISTANT 53.9 60.5 57.9 57.5 55.2 62.1 65.2 ASSIST 0 53.2 60.7 57.4 57.8 55.9 62.4 66.8 Table 3: Accuracy in lymphography at different percentages of learning.data. NESYS with (GINESYS**) and without (INE­SYS**) top-level multiple knowledge was compa­red. In Table 2 we present only the best three combinations of INESYS** in both domains. The four functions used were: I - informativity (Quin­lan 1986); A - % of majority class; G - Gini index (Breiman et al. 1984); B - Laplacean error esti­mate (Niblett k Bratko 1986). The flrst letter denotes the impurity function and the second let­ter the error estimate function. Measurements presented in Table 2 indicate that Laplacean error estimate is one of the most successful functions used for impurity or error estimates. Informativity is unexpectedly not pre­sent in the best three combinations. Default func­tions for GINESYS systems (GB) were taken in advance from the literature (Breiman et al. 1984; Niblett k Bratko 1986). 4.4 Varying Percentage of Learning Examples Benchmarks in sections 4.2 and 4.3 were perfor­med on 10 distributions of data each time taking 70% of data for learning and 30% of data for te­sting. In Table 3 and 4 we varied the percentage of learning data from 20% to 80% and used the remaining data for testing. Graphical represen­tation of data in Table 4 is shown in Figure 1. Systems in Figure 1 are denoted as in column 1 of Table 4. ASSIST 0 is ASSISTANT without pruning and INESYS is GINESYS without the statistical method. Probably the main reason for unproportionally low classification accuracy of INESYS with small number of learning examples are functions which work well only with several ten examples. But even then there are some cases when INESYS classifies better than BAYES. The combining me­chanism usually decides well when to choose the right method. The performance of INESYS incre­ases with the number of learning examples, and the gain of GINESYS over BAYES also proporti­onally increases. In lymphography, ASSISTANT prunes the tree by approximately 50% and achie­ves very similar classification accuracy as ASSIST 0. In primary tumor, the pruned tree constructed by ASSISTANT is roughly 4 times smaller than the tree of ASSIST 0 which besides constructing more complex trees also achieves lower classifica­tion accuracy. The improvement of GINESYS over the best of it's two subparts was typically around 1-2% lea­ding to a conclusion that the combining mecha­nism performed well when changing the number of learning examples. 4.5 Varying Additional Noise Noise was introduced into the lymphography and primary tumor domain to attributes and classes in the learning and test examples. For example, 1% of noise means that, on average, each hun­dred's value of attribute and each hundred's class was randomly changed in learning and test data. Average results of 10 tests (see section 4.2) are presented in Tables 5 and 6, and in Figure 2. When the amount of noise increases, the per­formance of INESYS relatively improves and achi­eves even better classification accuracy that GI­NESYS. As expected, in a very noisy situation, a small number of short rules performs the best. Si­ P. TUMOR % OF LEARNING EXAMPLES SYSTEM 20% 30% 40% 50% 60% 70% 80% GINESYS (G) 41.9 44.6 48.1 49.0 48.1 52.0 52.3 BAYES (B) 41.8 45.2 47.5 48.0 47.2 50.1 50.3 INESYS (I) 26.9 35.6 33.8 43.5 41.2 45.9 46.7 ASSISTANT (A) 39.8 43.5 43.5 45.9 44.3 47.9 49.2 ASSIST 0 (AO) 39.6 41.6 39.9 41.1 39.6 41.3 41.7 Table 4: As in Table 3, but for the primary tumor domain. G B I 30 A A0 20 30 AO 50 60 70 80 % Figure 1: Graphical representation of data in Table 4. On the x-axis is the percentage of learning dat a and on th e y-axis is classification accuracy. LYMPHOGRAPHY % OF ADDITIONAL NOISE • SYSTEM 0% 1% 5% 10% 20% 35% 50% GINESYS 70.5 65.3 63.7 53.1 43.8 28.9 21.1 BAYES 68.6 65.8 61.7 51.1 41.8 28.0 20.7 INESYS 67.4 63.4 59.1 53.0 41.4 30.3 25.4 ASSISTANT 62.1 60.2 52.8 34.1 33.3 23.4 18.4 ASSIST 0 62.4 60.5 51.8 41.6 29.9 23.5 17.6 Table 5: The influence of additional noise - lymphography. P. TUMOR % OF ADDITIONAL NOISE SYSTEM 0% 1% 5% 10% 20% 35% GINESYS (G) 52.0 50.6 42.6 35.2 23.5 13.8 BAYES (B) 50.1 47.8 40.3 33.5 23.6 13.9 INESYS (I) 45.9 43.5 36.2 30.7 20.0 16.1 ASSISTANT (A) 47.9 44.9 39.4 30.5 16.7 8.4 ASSIST 0 (A0) 41.3 39.1 32.4 25.3 14.5 8.7 Table 6: The influence of additional noise - primary tumor. milar effect is noticeable in the lymphography do­main especially compared to ASSISTANT and is probably connected t o the fact tha t ASSISTANT constructs a tree of several tens of leaves while INESYS constructs from 2 to 5 rules. With a growing amount of noise, the gain of GINESYS slowly decreases but remains around 2% as long as any rule of INESYS can be trusted as the me­aningful one. Ne w Measurement s In further attempts t o verify the obtained results presented in Section 4, GINESYS and benchmark data were around five years ago sent to over 50 laboratories and declared to be freely available for scientific purposes. The obtained answers can be clustered into two groups: several laboratories benchmarked systems on the proposed two do­mains, or at least approved the approach. On the other hand, there were some researchers who con­sidered proposed benchmarking of classification accuracy as a numerical measurement belonging to statistics. In their opinion, artificial intelli­gence methods should be evaluated mainly at the level of ideas. Indeed, measuring only classifica­tion accuracy does not consider several important advantages of artificial intelligence, e.g., the tran­sparency of the constructed knowledge base or the comprehensibility of classifications. However, in the last two years we have observed a constant shift in a direction which accepts such verificati­ons as crucial in evaluating quality. In 1990 we received the first, and so far only report of a system, NAIVE BAYES* (Cestnik 1992), which achieved better accuracy than GI­NESYS in both domains (54.1% in primary tumor and 70.9% in lymphography). The improvement is based on a correction of the weakness of NA­IVE BAYES which happens whenever there is a gap in the data, meaning there is no example with the particular value of the attribute. Then, one factor in the product becomes 0 and the resul­ting product (Eq.l ) becomes 0. This was already Figure 2: Graphical representation of data in Table 6. On the x-axis is the percentage of additional noise and on the y-axis is classification accuracy. observed in (Gams k, Drobnič 1988; Gams et al. 1991) where e was used instead of 0. In NAIVE BAYES*, the Laplacean estimate is introduced for a correction instead of e. The reported improvements enabled additional experiments in trying to construct a multiple sy­stem, achieving even better classification accu­racy. In the first attempt, NAIVE BAYES* was directly embedded into GINESYS, but the obser­ved classification accuracy was lower than tha t of NAIVE BAYES*. Obviously, a smaller number of stronger rules had to be constructed since NA­IVE BAYES* achieved significantly better clas­sification accuracy than GB. Several parameters in GINESYS deal with rules, such as significance (Kalbfleish 1979), modified Laplacean error esti­mate (Niblett & Bratko 1986) or MINDIFF and MINACC. In the second attempt, MINDIFF was set to 0.5 instead of the previous 0.3, and MI­NACC to 0.7. Consequently, GINESYS90 achie­ved an additional 0.8% increase in primary tumor and 1.3% in lymphography over NAIVE BAYES*. Later it was found tha t the values of MINACC and MINDIFF belong to the set of optimal com­binations, as can be observed in Tables 8 and 9. The updated versions of NAIVE BAYES and GINESYS achieve the best two classification accuracies (compare Table 1 and Table 7). The percentage of corrections by NAIVE BAYES was 8% in lymphography and 27% in primary tumor in GINESYS and, correspondingly, 25% and 45% in GINESYS90. New values of parameters MINDIFF and MI­NACC force GINESYS90 to construct a smaller number of longer rules. Also, rules are usually ro­ughly twice more often corrected by NAIVE BA­YES* than in GINESYS. To a great extent, this is due to the increased average number of classi­fications performed by the null or uninstantiated rule, i.e. the last rule in a rule list. This num­ber increased from 9.2 to 15.9 in lymphography (45 classifications), and from 18.0 to 55.1 in pri­mary tumor (102 classifications). Understanda­bly, the last uninstantiated rule is always conside­red as unreliable in GINESYS and GINESYS90. But in the INESYS and INESYS90 algorithm, the classification is stili performed by corresponding null-rule class distribution which is typically only slightly better than the default rule. Therefore, it is understandable tha t on average accuracy of SYSTEM LYMPHOGRAPHY PRIMARY TUMOR class.acc. no.rules no.att. class.acc. no.rules no.att. INESYS90 63.7 3.8 7 36.3 6.9 19 NAIVE BAYES* 70.9 56.0 56 54.1 37. 0 37 GINESYS90 72.2 59.6 63 54.9 44.3 56 Table 7: Accuracy, number of rules, of ali attributes. LYMPHOGRAPHY ACC. .1 .2 .3 .4 .5 .6 .7 .8 .9 .9 71.0 + .8 72.0