Benchmarking Indicates Relevance of Multiple Knowledge 
Matjaž Gams Jožef Štefan Institute, Jamova 39, 61000 Ljubljana, Slovenia Phone: +386 61 125 91 99, Fax: +386 61 219 677 E-mail: matjaz.gams@ijs.si 
Keywords: artificial intelligence, multiple knowledge, multistrategy learning 
Edited by: Anton P. Železnikar Received: March 17, 1994 Revised: November 11, 1994 Accepted: November 15, 1994 
Over the last 7 years, detailed measurements of available learning systems were per­formed on two real-life medical domains with the purpose to verify the importance of multiple knowledge. The performance of the combined system GINESYS, consisting of an artificial intelligence and a statistical method, was analysed with and without multiple knowledge and by varying the number of learning examples, the amount of artificially added noise, the impurity and the error estimate functions. These mea­surements and those of other researchers indicate that multiple knowledge can provide essential improvements. Measurements also indicate that improvements over "one-level" or monostrategy knovrledge representation representations are quite common in real-life noisy andincomplete domains. 

 Introduction 

In easing the bottleneck of knowledge acquisi­tion in expert svstems (Harmon et al. 1988), automatic knowledge construction from examples has proven useful in many practical tasks. Quite often, examples are described in terms of attribu­tes and their values and each example belongs to a certain class. The task of the svstem is to induce concept descriptions from examples. First sv­stems were designed for exact domains like chess end-games and constructed trees (ID3 - Quinlan 1983) or lists of rules (AQ11 - Michalski & Lar­son 1983). But in many real-life domains (Gams & Karalič 1989), because of noise or incomplete description (Manago h Kodratoff 1987) speciali­sed mechanisms have to be applied. In noisy do­mains, longer rules (or longer branches in trees) perform better on learning examples while trun­cated rules (pruned tees with shorter branches) perform better on unseen examples. On the ba­sis of this principle, the second group of inductive svstems emerged (CART - Breiman et al. 1984; Quinlan 1987). Around five years ago the third group of systems began emerging (GINESYS -Gams 1988; 1989; LOGART - Cestnik, Bratko 1988; new CN2 - Clark & Boswell 1991), based on the explicit use of multiple knowledge.1 Each of these groups of systems usually achieves better performance than previous. Better performance of multiple knowledge systems was especially no­ticeable in classification accuracy, also in better comprehensibility (although more difficult to me­asure) when compared to the other two groups. At the same tirne, their efficiency remained simi­lar to those in the second group. 
With measurements presentedin this paper we give additional arguments for successfulness of multiple knowledge by explicitly measuring the influence of the number of learning examples and the influence of noise, as well as the influence of the error estimate and impurity functions. Ben­chmarking was performed on two often used do­mains - lymphography and primary tumor (Clark & Niblett 1989; Michalski et al. 1986; Cestnik & Bratko 1988; Gams 1988). 
AQ15 - Michalski et al. 1986; ASSISTANT - Ko­
'By 'multiple knowledge' we refer to multiple models, nonenko 1985; CN2 - Clark & Niblett 1989; C4 -multiple systems or multiple methods. 
Here we present results of benchmarking over a period of 7 years. Testing was performed al­ways on the same two oncological domains. Al­together, around 20 systems were benchmarked. Our system GINESYS was constructed on the ba­ses of first benchmarking of around 10 systems in 1987 from a frustration since statistical systems have regularly achieved better accuracy than sin­gle Al systems. GINESYS is described in Section 3, benchmarking in Sections 4 and 5. 
2 Multiple Knowledge and Multistrategy Learning 
Even first expert systems like MYCIN (Shortliffe 1976) and most rule-based systems already ena­bled a certain amount of multiplicity, i.e. redun­dancy, since rules can be more or less multiple. Newer systems like CN2 (Clark & Niblett 1989) or C4 (Quinlan 1987) contain similar amount of redundancy which is probably one of the reasons for their successful behaviour in noisy domains. In (Catlett & Jack 1987) it was reported that con­structing a separate decision tree for each class with the same method as when constructing one decision tree for ali classes significantly increased accuracy. Similar conclusion was derived by Clark & Boswell (1991) when constructing several lists of rules and by Buntine (1989) when combining 5 decision trees with different roots. 
In Communications, the positive efFect of using redundant bits is known for decades and even sim­ple ID numbers in banking have additional digits in order to improve the robustness of the whole sy­stem. Theoretical aspects of redundancy in such cases are described e.g. in (Shannon & Weaver 1964). 
In most every-day activities, people use multi­ple knowledge whenever there is any possibility of biasing (Utgoff 1986). For example, when hiring a new employee, one checks several reports which are basically multiple (e.g. biography, recommen­dations etc) . When bringing an important deci­sion, humans often discuss possibilities in groups of relevant people. A council of physicians is con­sulted when dealing with difficult or important cases. One physician suffices for most of normal activities since one is substantially cheaper than a group of them. 
It is commonly accepted that cross-checking of M. Gams 

several knowledge sources is generally better than using one source of knowledge alone. Humans are intrinsically multiple. They apply multiple strategies in every-day activities without paying much attention to that phenomenom. Therefore, machine and human multistrategy learning have natural interrelationship and potential benefits in both directions. 
In recent years there were several distinguished events related to mutistrategy learning. Among them: a book edited by R. Michalski & G. Te­cuci: Machine Learning, A Multistrategy Appro­ach, Vol. IV, Morgan Kaufmann (1994), speci­alised international workshops on multistrategy learning organised by George Mason Universitv, special issue of Informatica (Tecuci 1993), and IJCAI-93 workshop on integration of machine le­arning and knowledge acquisition (Tecuci, Kedar & KodratofF 1993). 

3 GINESYS 
GINESYS (Generic INductive Expert SYstem Shell) is one of the oldest systems actively uti­lising multiple knowledge representations (Gams 1988). It consists of two systems (i.e. methods), one from Al and one from statistics. There were sensible reasons for combining methods from di­fferent fields. First of ali, artificial intelligence methods enable construction of knowledge bases which are typically very transparent and under­standable; therefore, it was hoped that a combi­nation would stili be more understandable than statistical knowledge bases. A statistical method was chosen on the basis of the hypothesis that knowledge representations should be as different as possible. 
GINESYS utilises two different strategies on the basis of these two svstems: the Al svstem constructs and consults lists of rules, and the statistical system multiplies probabilities accor­ding to the distribution of classes corresponding to each attribute of the tested example. Both single systems already implicitly utilise multiple knowledge - the Al part through a couple (typi­cally 5) of rules attached to the main one which are triggered when classifying, and the statistical part through combining probabilities relating to the value of each attribute of the tested example. 
The Al part of GINESYS is named INESYS 
d : = (0) ; (*d i s initialised*) repeat Star := (NP); BestRules := (NP); 
repeat for ali Rulej from Star generate ali specialisations NetvRulei that do not fulfil the stopping criteria; Star := (); put into Star at best MAXSTAR the best NeivRulei evaluated by user defined impurity function; from rules in BestRules and significant NewRule{ choose the best MAXBEST rules, evaluated by user defined 
error estimate function, and put them into BestRules until Star is empty; add BestRules into d; L := L - examples, covered by the best evaluated rule from BestRules 
until L is empty 

The INESYS algorithm 
(see top of the page). It reimplements many of mechanisms of the ID3 and AQ (CN2) family of algorithms. It was primarily designed as an at­tempt to fully simulate the family of ID3 and AQ inductive empirical learning systems (Gams k Lavrač 1987). Theoretically, it simulates NM different algorithms where M is the number of modules of the algorithm and N is the number of variations of each module (Gams 1989). The ac­tual number of different variations of GINESYS can be estimated to severa! hundreds. 
INESYS constructs rules with a beam search over ali possible combinations of attributes. In addition, it utilises several search-guiding and error-estimate functions such as informativitv, the Gini index, Laplacean error estimate and signi­ficance. Due to elaborate mechanisms for noise handling, INESYS typically constructs a small number of short rules, i.e. with a small number of attributes. For example, on average, 5.1 main rules with 1.4 attributes in a rule were construc­ted in lymphography. In primary tumor, there were 11.0 main rules with 2.3 attributes in a rule. Therefore, a typical rule had the form: 
if (Ai = Vij)fc(Ak = Vki) then Distributionn where 

—	 (A{ = Vij) is a Boolean test whether attribute i has value j , and 
—	 Distributionn is a class probability distribu­tion corresponding.to.the cpndition part of the rule, i.e. a complex. 
A general description of INESYS is: 
repeat 
construct Rule(s); 
add Rule(s) to d; 
L := L - ExamplesCoveredByRule 
until satisfiable d 
where L is the set of learning examples, d is constructed knowledge in the form of trees or lists of rules and Rule(s) is one or many branches in a tree or one or many rules in the list of rules. A procedural description of INESYS is presented at the top of the page where d is the constructed knowledge in the form of an ofdered list ofordered lists of rules, Star and BestRules are ordered lists of rules and L is the set of learning examples. NP is a rule with an uninstantiated complex and class probability distribution of L. 
In INESYS, the main improvement regarding existing rule-based systems are rules attached to the main rule. The aim of these multiple rules is twofold. First, to give the user more rules and thus more opportunities to analyse the laws,of the domain. Second, to improve classification accu­racy by cross-checking the matched rule with con­firmation rules. This mechanism already enables the use of multiple knowledge to a certain degree: 

i f Complexl the n Class l 
(Complexll the n Classl l 

ComplexlR the n Class1R) els e i f Complex2 the n Class 2 (Complex2i the n Class21 
Complex2R the n Class2R) 

Classification in INESYS starts by sequentially checking main rules. When the first main rule matches a new example, corresponding multiple rules tha t match the new example add their class probability distribution according to the formula for the union of independent events 
Pi2 = P\ + (1 - Pi) X p2. 

Probabilities are multiplied by error estimates in order to calibrate the effect of rules with diffe­rent credibility, and finally normalised. There are two threshold parameters that present a heuristic estimate of the goodness of classification by a rule: the smallest necessary percentage of the majority class (MINACC) and the smallest difference be­tween the percentage of the majority class and the second to majority class (MINDIFF). Each con­structed rule in GINESYS has to satisfy both con­ditibns. Parameter MINDIFF additionally affects the classification process in the sense that the class probability distribution of a combined main and confirmation rules must satisfy it. 
The second method in GINESYS is the appro­ximation of the Bayesian classifier which assumes independence of attributes. It is often referred to as "naive Bayes"(Good 1950), in this paper also "Bayes". Naive Bayes constructs ali possible ru­les with only one attribute in the complex. The­refore, the form of these rules is: 
if (A, = Vij) then Distributionn. 
The classification schema is as follows: ali rules, that match a new example, are taken in considera­tion. The probability of each class c is computed by the following formula: 
P(c\A) = P.(c) X {P{Ax\c)lPa{A{)) x .. . x(P(Av\c)/Pa(Av)) (Eq.l) 

where P(c|A) denotes probability of class c given attributes and values A of the tested example, Pa(c) denotes the a priori probability of class c, P(Ai\c) the probability that attribute A{ has the same value as the classified example regarding the class c, Pa(Ai) the same as before, but regardless of class, and v is the number of attributes. By calculating probabilities of ali classes by (Eq.l.), a class probability is obtained. Therefore, although naive Bayes constructs rules similar to INESYS, in the process of classification ali attributes are considered in Bayes and on average only around two in INESYS. 
Cooperation between the Al and the statistical system is relevant only when they propose diffe­rent classes. In that čase, the goodness of trigge­red rules in INESYS is estimated by the simple he­uristics mentioned above. If the goodness of com­bined rules exceeds the value of a given threshold (parameter MINDIFF), classification by INESYS is adopted. Otherwise, the classification by naive Bayes prevails. In other words: If class proba­bility distribution of combined rules is estimated as unreliable, the statistical method is called as a supervisor to decide which class is estimated as the most probable. 
The combining schema is based on the follo­wing reasoning: When multiple rules confirm the main ones, classification is very likely to be cor­rect. If a significant disagreement occurs then the list of rules is not credible and the other method using difFerent knowledge representation should be consulted. It was expected that short rules constructed by INESYS will be more successful when they have high confidence in their predic­tion, and the approximation of the Bayesian clas­sifier to be more successful when dealing with di­fficult cases where truncated rules capturing the main and most important laws of the domain are not predicting with great certainty. 

4 Benchmarking 
Since 1987, systematic measurements are being performed on two oncological domains, lympho­graphy and primary tumor. Data were obtained from real patients from the Oncological institute Ljubljana (Kononenko 1985; Cestnik & Bratko 1988). Unknown values of attributes were re­SYSTEM LYMPHOGRAPHY PRIMARY TUMOR class.acc. no.rules no.att. class.acc. no.rules no.att. 
GINESYS*  70.5  5.1  7  52.2  11.0  25  
GINESYS  70.5  5.1  7  52.0  11.0  1  25  
BAYES  68.6  56.0  56  50.1  37.0  37  
CN2-newl  68.7  50.3  
GB*  67.4  5.1  7  47.6  11.0  25  
CN2-newl'  65.6  46.9  
NEAREST NEIG.  72.9  40.4  
C4.5-rules  64.7  38.2  
C4.5-trees-u  63.1  48.9  
C4.5-trees-p  66.7  48.8  
CN2-likel  67.3  4.8  8  48.7  11.4  27  
CN2-likel'  66.1  5.0  6  45.6  10.8  22  
ID3-like  61.8  25.0  110  48.7  28.6  129  
CN2-like2  66.8  10.8  21  45.7  19.3  70  
CN2-like2'  65.0  9.4  16  46.2  19.4  68  
AQ-likel  60.6  7.0  80  48.8  16.0  423  
AQ-like2  55.2  7.0  80  32.0  16.0  423  

Table 1: Benchmarking systems on two oncological domains. 
placed by the most common values regarding the class. 
4.1 Domain Description 
Basic statistics of the whole set of data are: 

LYMPHOGRAPHY 18 attributes 2 - 8 (average 3.3) values per attribute 9 classes 150 examples distribution: 2 1 12 8 69 53 1 4 0 ali examples differ even if one attribute is deleted 
PRIMARY TUMOR 17 attributes 2 - 3 (average 2.2) values per attribute 22 classes 339 examples distribution: 84 20 9 14 39 1 14 6 0 2 28 16 7 24 2 1 10 29 6 2 1 24 75 examples in the data set have another example with the same values of attributes and different class; if we delete one attribute, this number is: 114 111 81 122 84 75 93 79 97 91 77 83 76 77 79 94 94 

4.2 Benchmarked Systems 
On the benchmark domains, around 20 Al and statistical systems were compared over more tha n half of a decade. Ali the systems were given the same set of 10 random distributions of data, each tirne taking 70% of dat a for learning and 30% of data for testing. Results of relevant systems are presented in Table 1. The rov/ in the middle of the Table divides multiple and single systems, i.e. those tha t use only one rule or combine many ru­les during one classification. 
GINESYS* is a version of GINESYS using "ne­gation" multiple rules, which tr y to confront the main rule if possible. BG* is GINESYS* without the statistical method, i.e. INESYS with func­tions B and G. First nearest neighbour algori­thm classifies with the class of the nearest nei­

LYMPHOGRAPHY 	PRIMARY TUMOR 

FUNCTIONS INESYS** GINESYS** FUNCTIONS INESYS** GINESYS** 
AB 68.4 69.7 BA 48.3 52.3 
GB 67.4 69.9 BG 48.1 51.8 
BB 66.4 70.8 GB 47.6 52.0 
BG 62.6 68.4 AB 46.6 51.3 
BA 62.4 68.4 BB 46.4 52.5 

Table 2: Accuracv under different impuritv and error estimate functions. 
ghbour where distance is measured by the number of attributes with different values. BAYES is an approximation of the Bavesian classifier using an assumption tha t attributes are independent. ID3­like is a version of the ASSISTANT svstem using cross-validation pruning. CN2-like systems are di­fferent modifications of the CN2 algorithm, and CN2-new systems are latest versions. C4.5-rules constructs rules, C4.5-trees-u unpruned trees, and C4.5-trees-p pruned trees. AQ-like systems are modifications of the AQ15 systems. 
Classification accuracy (column 1 in each do­main in Table 1) was measured as an average per­centage of correct classifications in ten test runs. The second column in each domain represents the average number of rules in a rule list or branches in the tree. The third column is a product of the number of rules (branches) times the average length of a rule (branch) times the number of in­ternal disjunctions. 
The relations between systems are similar to those observed in other measurements (Clark & Niblett 1989; Rendell et al. 1987; Rendell et al. 1988). Systems of the AQ family usually achieve lower classification accuracy than CN2 or ASSIS­TANT, while ASSISTANT and CN2 achieve simi­lar classification accuracy. AQ-likel represents an estimate of the upper possible classification accu­racy of the rules, constructed by the AQ-like sy­stem. BAYES achieved better results than other systems except GINESYS. Nearest neighbour al­gorithm seems to be very domain dependent. GI­NESVS achieved the best average classification accuracy over both domains. 
AQ-like systems construct more complex ru­les than other systems. However, the third co­
lumn might be misleading for tree constructing algorithms like ID3-like because it represents tree as a list of separated branches. GINESYS* and GINESYS are measured only by the main rules and not by the multiple ones. On the other side, from the results in Table 1 it follows tha t systems like GINESYS and CN2 construct smaller num­ber of shorter main rules while AQ-like systems construct more complex rules. 
The efficiency of the benchmarked algorithms was also analysed. AQ systems are about an or­der of magnitude slower than ASSISTANT, CN2 and GINESYS, and these are about an order of magnitude slower than BAYES. Results are simi­lar to other measurements when having in mind that our versions of CN2 and GINESYS use a data compression mechanism which speeds up the algo­rithm roughly five times. GINESYS PC , another version of GINESYS, runs on IBM PC computers and is available as a free scientific software. 

4.3	 Varying Impurity and Error Estimate Functions 
In order to verify whether improvements in GI­NESVS were caused by multiple knowledge or by domain-dependent parameters, several para­meters were varied, and functions were the first among them. GINESYS ušes two different gro­ups of functions: informativity functions and er­ror estimate functions. Informativity functions strategically guide search by trying to determine the amount of impurity. Error estimate functions try to estimate classification error. Four func­tions were used in ali 16 possible combinations in each domain. Classification accuracy of GI­
LYMPHOGRAPHY  % OF LEARNING EXAMPLES  
SYSTEM  20%  30%  40%  50%  60%  70%  80%  
GINESYS  52,8  58.2  63.1  63.7  60.1  70.5  75.3  
BAYES  52.8  59.3  60.8  61.2  58.2  68.6  72.1  
INESYS  39.2  51.7  54.1  62.6  59.0  67.4  74.3  
ASSISTANT  53.9  60.5  57.9  57.5  55.2  62.1  65.2  
ASSIST 0  53.2  60.7  57.4  57.8  55.9  62.4  66.8  

Table 3: Accuracy in lymphography at different percentages of learning.data. 
NESYS with (GINESYS**) and without (INE­SYS**) top-level multiple knowledge was compa­red. In Table 2 we present only the best three combinations of INESYS** in both domains. The four functions used were: I - informativity (Quin­lan 1986); A - % of majority class; G - Gini index (Breiman et al. 1984); B - Laplacean error esti­mate (Niblett k Bratko 1986). The flrst letter denotes the impurity function and the second let­ter the error estimate function. 
Measurements presented in Table 2 indicate that Laplacean error estimate is one of the most successful functions used for impurity or error estimates. Informativity is unexpectedly not pre­sent in the best three combinations. Default func­tions for GINESYS systems (GB) were taken in advance from the literature (Breiman et al. 1984; Niblett k Bratko 1986). 

4.4	 Varying Percentage of Learning Examples 

Benchmarks in sections 4.2 and 4.3 were perfor­med on 10 distributions of data each time taking 70% of data for learning and 30% of data for te­sting. In Table 3 and 4 we varied the percentage of learning data from 20% to 80% and used the remaining data for testing. Graphical represen­tation of data in Table 4 is shown in Figure 1. Systems in Figure 1 are denoted as in column 1 of Table 4. ASSIST 0 is ASSISTANT without pruning and INESYS is GINESYS without the statistical method. 
Probably the main reason for unproportionally low classification accuracy of INESYS with small number of learning examples are functions which work well only with several ten examples. But even then there are some cases when INESYS classifies better than BAYES. The combining me­chanism usually decides well when to choose the right method. The performance of INESYS incre­ases with the number of learning examples, and the gain of GINESYS over BAYES also proporti­onally increases. In lymphography, ASSISTANT prunes the tree by approximately 50% and achie­ves very similar classification accuracy as ASSIST 
0. In primary tumor, the pruned tree constructed by ASSISTANT is roughly 4 times smaller than the tree of ASSIST 0 which besides constructing more complex trees also achieves lower classifica­tion accuracy. 
The improvement of GINESYS over the best of it's two subparts was typically around 1-2% lea­ding to a conclusion that the combining mecha­nism performed well when changing the number of learning examples. 


4.5	 Varying Additional Noise 
Noise was introduced into the lymphography and primary tumor domain to attributes and classes in the learning and test examples. For example, 1% of noise means that, on average, each hun­dred's value of attribute and each hundred's class was randomly changed in learning and test data. Average results of 10 tests (see section 4.2) are presented in Tables 5 and 6, and in Figure 2. 
When the amount of noise increases, the per­formance of INESYS relatively improves and achi­eves even better classification accuracy that GI­NESYS. As expected, in a very noisy situation, a small number of short rules performs the best. Si­

P. TUMOR % OF LEARNING EXAMPLES 
SYSTEM 20% 30% 40% 50% 60% 70% 80% GINESYS (G) 41.9 44.6 48.1 49.0 48.1 52.0 52.3 BAYES (B) 41.8 45.2 47.5 48.0 47.2 50.1 50.3 INESYS (I) 26.9 35.6 33.8 43.5 41.2 45.9 46.7 ASSISTANT (A) 39.8 43.5 43.5 45.9 44.3 47.9 49.2 ASSIST 0 (AO) 39.6 41.6 39.9 41.1 39.6 41.3 41.7 
Table 4: As in Table 3, but for the primary tumor domain. 
G 
B I
30 


A A0 

20 30 AO 50 60 70 80 % 
Figure 1: Graphical representation of data in Table 4. On the x-axis is the percentage of learning dat a and on th e y-axis is classification accuracy. 
LYMPHOGRAPHY  % OF ADDITIONAL NOISE •  
SYSTEM  0%  1%  5%  10%  20%  35%  50%  
GINESYS  70.5  65.3  63.7  53.1  43.8  28.9  21.1  
BAYES  68.6  65.8  61.7  51.1  41.8  28.0  20.7  
INESYS  67.4  63.4  59.1  53.0  41.4  30.3  25.4  
ASSISTANT  62.1  60.2  52.8  34.1  33.3  23.4  18.4  
ASSIST 0  62.4  60.5  51.8  41.6  29.9  23.5  17.6  

Table 5: The influence of additional noise - lymphography. 
P. TUMOR  % OF ADDITIONAL NOISE  
SYSTEM  0%  1%  5%  10%  20%  35%  
GINESYS (G)  52.0  50.6  42.6  35.2  23.5  13.8  
BAYES (B)  50.1  47.8  40.3  33.5  23.6  13.9  
INESYS (I)  45.9  43.5  36.2  30.7  20.0  16.1  
ASSISTANT (A)  47.9  44.9  39.4  30.5  16.7  8.4  
ASSIST 0 (A0)  41.3  39.1  32.4  25.3  14.5  8.7  

Table 6: The influence of additional noise - primary tumor. 
milar effect is noticeable in the lymphography do­main especially compared to ASSISTANT and is probably connected t o the fact tha t ASSISTANT constructs a tree of several tens of leaves while INESYS constructs from 2 to 5 rules. With a growing amount of noise, the gain of GINESYS slowly decreases but remains around 2% as long as any rule of INESYS can be trusted as the me­aningful one. 



 Ne w Measurement s 
In further attempts t o verify the obtained results presented in Section 4, GINESYS and benchmark data were around five years ago sent to over 50 laboratories and declared to be freely available for scientific purposes. The obtained answers can be clustered into two groups: several laboratories benchmarked systems on the proposed two do­mains, or at least approved the approach. On the other hand, there were some researchers who con­sidered proposed benchmarking of classification accuracy as a numerical measurement belonging to statistics. In their opinion, artificial intelli­gence methods should be evaluated mainly at the level of ideas. Indeed, measuring only classifica­tion accuracy does not consider several important advantages of artificial intelligence, e.g., the tran­sparency of the constructed knowledge base or the comprehensibility of classifications. However, in the last two years we have observed a constant shift in a direction which accepts such verificati­ons as crucial in evaluating quality. 

In 1990 we received the first, and so far only report of a system, NAIVE BAYES* (Cestnik 1992), which achieved better accuracy than GI­NESYS in both domains (54.1% in primary tumor and 70.9% in lymphography). The improvement is based on a correction of the weakness of NA­IVE BAYES which happens whenever there is a gap in the data, meaning there is no example with the particular value of the attribute. Then, one factor in the product becomes 0 and the resul­ting product (Eq.l ) becomes 0. This was already Figure 2: Graphical representation of data in Table 6. On the x-axis is the percentage of additional noise and on the y-axis is classification accuracy. 

observed in (Gams k, Drobnič 1988; Gams et al. 1991) where e was used instead of 0. In NAIVE BAYES*, the Laplacean estimate is introduced for a correction instead of e. 
The reported improvements enabled additional experiments in trying to construct a multiple sy­stem, achieving even better classification accu­racy. In the first attempt, NAIVE BAYES* was directly embedded into GINESYS, but the obser­ved classification accuracy was lower than tha t of NAIVE BAYES*. Obviously, a smaller number of stronger rules had to be constructed since NA­IVE BAYES* achieved significantly better clas­sification accuracy than GB. Several parameters in GINESYS deal with rules, such as significance (Kalbfleish 1979), modified Laplacean error esti­mate (Niblett & Bratko 1986) or MINDIFF and MINACC. In the second attempt, MINDIFF was set to 0.5 instead of the previous 0.3, and MI­NACC to 0.7. Consequently, GINESYS90 achie­ved an additional 0.8% increase in primary tumor and 1.3% in lymphography over NAIVE BAYES*. Later it was found tha t the values of MINACC and MINDIFF belong to the set of optimal com­binations, as can be observed in Tables 8 and 9. 
The updated versions of NAIVE BAYES and GINESYS achieve the best two classification accuracies (compare Table 1 and Table 7). The percentage of corrections by NAIVE BAYES was 8% in lymphography and 27% in primary tumor in GINESYS and, correspondingly, 25% and 45% in GINESYS90. 
New values of parameters MINDIFF and MI­NACC force GINESYS90 to construct a smaller number of longer rules. Also, rules are usually ro­ughly twice more often corrected by NAIVE BA­YES* than in GINESYS. To a great extent, this is due to the increased average number of classi­fications performed by the null or uninstantiated rule, i.e. the last rule in a rule list. This num­ber increased from 9.2 to 15.9 in lymphography (45 classifications), and from 18.0 to 55.1 in pri­mary tumor (102 classifications). Understanda­bly, the last uninstantiated rule is always conside­red as unreliable in GINESYS and GINESYS90. But in the INESYS and INESYS90 algorithm, the classification is stili performed by corresponding null-rule class distribution which is typically only slightly better than the default rule. Therefore, it is understandable tha t on average accuracy of 
SYSTEM  LYMPHOGRAPHY  PRIMARY TUMOR  
class.acc.  no.rules  no.att.  class.acc.  no.rules  no.att.  
INESYS90  63.7  3.8  7  36.3  6.9  19  
NAIVE BAYES*  70.9  56.0  56  54.1  37. 0  37  
GINESYS90  72.2  59.6  63  54.9  44.3  56  

Table 7: Accuracy, number of rules, of ali attributes. 
LYMPHOGRAPHY  
ACC.  .1  .2  .3  .4  .5  .6  .7  .8  .9  
.9  71.0  +  
.8  72.0  <?  <?  +  
.7  71.6  +  
.6  72.0  <?  9  +  +  
.5  72.2  <?  9  <?  <?  V  + •  
.4  70.2  - - - +  
.3  70.7  - - - - +  +  
.2  70.2  - 
.1  68.2  - - +  

Table 8: Influence of the goodness criterion, GINESYS90, lymphography. 
INESYS90 decreased from 67.4% t o 63.7% in lym­phography and more, from 45.9% to 36.3% in pri­mary tumor. This should not blur the fact that the effective part of INESYS90 which takes part in classifications of GINESYS90 actually achieves better classification accuracy than INESYS. 
The influence of the MINDIFF and the MI­NACC parameters on the classification accuracy of GINESYS90 was further measured, and it was found tha t there is a wide range of possible com­binations which enable similar improvements (see Tables 8 and 9). 
The x-axis in Tables 8 and 9 corresponds to MI­NACC and the y-axis corresponds to MINDIFF ranging from 0.1 to 0.9. The second column of classification accuracies in each Table represents accuracy with current MINDIFF and MINACC < = MINDIFF. Each mark in Tables 8 and 9 re­presents one ten-runs measurement as follows (in percents): 
—	 bellow 70.9 in lymphography, bellow 54.1 in pri­mary tumor 
+	 between 70.9 and 71.9, between 54.1 and 54.6 correspondingly and, 
V over 71.9 (+1) in lymphography and over 54.6 (+0.5) in primary tumor. 
Top-level or global multiplicity in any version of GINESYS can be estimated by the percentage of different classifications of both single systems. In Table 10, it is presented for GINESYS90 in both domains with MINDIFF = 0.3 and 0.5 (MINDIFF = MINACC) on training and testing examples. 
Let us measure the internal multiplicity of each monostrategy system. INESYS90 constructs a list-of sublists of rules. However, the order of rules is important and the confirmation rules are atta­ched to the main rules. Therefore, each sublist of rules corresponds to a particular subset of train­
ACC.  .1  
.9  54.3  +  
.8  54.3  +  
.7  54.3  +  
.6  54.5  +  
.5  54.9  <?  
.4  53.5  - 
.3  53.6  - 
.2  53.1  - 
.1  51.9  - 


PR] [MARY TUMOR 

.2  .3 - .4 - .5 9 - .6 V  .7 9 - .8 + + + + +  .9  
- - +  +  

Table 9: As in Table 8, but primary tumor. 
MINDIFF LYMPHOGRAPHY  PRIMARY TUMOR  
train  test  train  test  
0.3  28  26  28  34  
0.5  26  29  44  49  

Table 10: Percentage of different classifications, i.e. top-level or global multiplicity in GINESYS90. 
ing dat a and there seems to be no natural way to extract many knowledge bases such tha t each co­vers the whole measurement space. On the other hand, rules in both NAIVE BAYES and NAIVE BAYES* have the form if(Ai = Vij) then Distributionn and are constructed on the whole training data. Therefore, a list of rules with the same attribute and ali possible values of tha t attribute represents one knowledge base covering the whole measu­rement space. The average percentage of diffe­rent classifications of each such knowledge base and the combined knowledge base is presented in Table 11. It should be observed tha t the same single knowledge bases are used in NAIVE BA­YES and NAIVE BAYES*, but they are differen­tly combined. Whatever the čase, both NAIVE BAYES and NAIVE BAYES* can be regarded as internally consisting of multiple knowledge bases. Furthermore, these knowledge subbases are quite independent of each other, although they are con­
structed on the same training data. Overall, finding areasonable combination of the two knowledge bases, i.e. GINESYS90, took only one day of work and resulted in achieving an ave­rage 1% increase in classification accuracy. The amount of efforts needed was evidently small be­cause only already existing systems had to be mo­dified. 
6 Discussion 
Multiple knowledge has proven useful in many measurements, first in (Brazdil & Torgo 1990; Buntine 1989; Catlett & Jack 1987; Cestnik & Bratko 1988; Clark k Boswell 1991; Gams 1988; 1989; Gams, Drobnič & Petkovšek 1991), and fol­lowed by tens of reports in the last couple of years. In our measurements, classification accuracy of the combined knowledge base was typically bet­ter than the accuracy of each single knowledge 
SYSTEM LYMPHOGRAPHY PRIMARY TUMOR 
train test train test NAIVE BAYES 48 50 72 70 NAIVE BAYES* 44 46 65 66 
Table 11: Percentage of different classifications in BAYES, i.e. internal multiplicity. 
base. However, due to a relatively high standard deviation the statistical significance of this im­provement cannot be proved in 10 tests (Gams 1989). On the other hand, additional measure­ments were performed by varying parameters of GB (form and number of multiple rules, goodness of rules, factor of significance, impurity functions, error estimate functions) and domain parameters (percentage of training and testing data, percen­tage of additional noise). In this paper we pre­sent over 200 measurements each tirne averaging 10 tests. If we delete measurements with more than 20% of additional noise and those with less than 70 learning examples, we obtain 167 mea­surements with only 3 cases where (a version of) GINESYS has not achieved the best classification accuracy. The improvement was typically around 1%. 
Therefore, the improvement in 167 measure­ments (each time averaging over 10 tests) is stati­stically highly significant. Although more inten­sive measurements were performed in recent ye­ars, e.g., (Brazdil et al. 1994), measurements in this paper present one of the longer-lasting efforts. 
Besides better classification accuracy, impro­ved explainability and understandability were also reported. Indeed, the informativity of the knowledge base with multiple rules seems to be much better than without them. Multiple rules can be trimmed off and a "usual" knowledge base is obtained as a downgraded version. Since a user can define the number of multiple rules, the prefe­rence function and other parameters, it enables a thorough extradition of most valuable rules. The efnciency of the learning algorithms remains prac­tically the same when using multiple knowledge. 
In conclusion, more and more indications emerge that "single-knowledge" systems in ge­neral do not achieve the performance of "multiple-knowledge" systems. Therefore, multi­ple knovvledge isbecoming regularly implemented in recent systems. The reported gains are usually substantial at small additional cost. 
While research on monostrategy methods and one-level knowledge representations continues to be of great importance to the machine learning community, the interest and amount of rese­arch work in multistrategy learning and multiple knowledge representations rapidly increases over the last couple of years. Expansion is accompa­nied by great diversification and new approaches. 
In general, multiple systems enable greater competence than monostrategy systems relying on one knowledge representation and one com­puting mechanism. On the other hand, multiple systems demand more understanding of capaci­ties, limitations and cooperation between single systems. Due to the constant growth of compu­ter power, speed and memory requirements have to a great extend diminished, thus bringing the focus to essential research and engineering que­stions. 
ACKNOWLEDGMENTS 
This work was supported by the Ministry of Sci­ence, Research and Technology, Republic of Slove­nia and was carried out as a part of European Pro­ječt ESPRIT II Basic Research Action number 3095, Project ECOLES. Research facilities were provided by the "Jožef Štefan" Institute. Data were provided by the Oncological institute of the University medical centre in Ljubljana. We are grateful for suggestions from prof. Ivan Bratko. 



References 

[1] Brazdil P., Gama J. & Henery B. (1994), "Characterizing the Applicability of. Classi­fication Algorithms Using Meta-Level Lear­ning", Proč. ofECML-94, Italy. 
[2] Brazdil P.B. k Torgo L. (1990) "Knowledge Acquisition via Knowledge Integration", 
Proč. ofEKAW-90. 

[3] Breiman	 L., Priedman J.H., Olshen R.A. k Stone C.J. (1984) Classification and Regres­sion Trees, Wadsworth International Group. 
[4] Buntine	 W. (1989) "Learning Classification Rules Using Bayes", Proceedings of the 6th International Workshop on Machine Lear­ning, Ithaca, New York. 
[5] Catlett J.	 k Jack C. (1987) "Is it Better to Learn Each Class Separately?", Technical re­port. 
[6] Cestnik,	 B. (1992), Probability Estimations in Automatic Learning, Ph.D. Dissertation. 
[7] Cestnik B. k Bratko I. (1988) "Learning Re­dundant Rules in Noisy Domains", Proč. of EGAI, Munich. 
[8] Clark	 P. k Boswell R. (1991) "Rule In­duction with CN2: Some Recent Improve­ments", Proceedings of EWSL-91, Porto. 
[9] Clark P. k Niblett P. (1989) "The CN2 In­duction Algorithm", Machine Learning, Vol. 3, No. 4, Kluwer Academic Press. 
[10] Gams M. (1988)	 Unifying Principles in Au­tomatic Learning, Ph.D. thesis, Ljubljana. 
[11] Gams	 M. (1989) "New Measurements Hi­ghlight the Importance of Redundant Kno­wledge", Proč. of EWSL-89, Montpellier. 
[12] Gams M. k Drobnič M. (1988) Approaching the Limit of Classification Accuracy, Infor­matica, No. 2. 
[13] Gams M., Drobnič M. k Petkovšek M. (1991) "Learning from Examples - a Uniform View", 
International Journal of Man-Machine Stu­dies, Vol. 34, No. 1. 

[14] Gams M. k Karalič A. (1989) "New Empi­rical Learning Mechanisms Perform Signifi­cantly Better in Real Life Domains", Proč. 
M. Gams 
of the International Workshop on Machine Learning, Ithaca, New York. 

[15] Gams M. k Lavrač N. (1987) "Review of Five Empirical Leaning Systems Within a Propo­sed Schemata", in Progress in Machine Lear­ning, (ed. Bratko L, Lavrač N.), Sigma Press. 
[16] Good I.J.	 (1950) Probability and the Weigh­ting of Evidence, Charles Griffing k Co. Li­mited, London. 
[17] Harmon P., Maus R.	 k Morrisey W. (1988) Ezpert Systems, Tools and Applications, John Wiley k sons. 
[18] Kalbfleish J.	 (1979) Probability and Statisti­cal Inference II, Springer-Verlag. 
[19] Kononenko	 I. (1985) "ASSISTANT: A Sy­stem for Inductive Learning", M.Se. thesis, Ljubljana. 
[20] Manago M.V.	 k Kodratoff Y. (1987) "No­ise and Knowledge Acquisition", Proč. of IJ­CAI, Milano. 
[21] Michalski,	 R. (1994) "Inferential Theory of Learning: Developing Foundations of Mul­tistrategy Learning", in Michalski, Tecuci (ed.), Machine Learnine, A Multistrategy Approach, Vol. IV, Morgan Kaufmann. 
[22] Michalski R.S.	 k Larson J. (1983) "Incre­mental Generation of VL1 Hypotheses: The Underlaying Methodology and Description of the Program AQ11", Technical Report ISG 83-5, Urbana: University of Illinois. 
[23] Michalski R.S., Mozetič L,	 Hong J. k La­vrač N. (1986) "The Multi-purpose Incre­mental Learning System AQ15 and its Te­sting Application t o three Medical Doma­ins", Proč. of A A Al 86, Philadelphia, USA. 
[24] Michalski, R. k Tecuci G. (ed.) (1994) Ma­chine Learning, A Multistrategy Approach, Vol. IV, Morgan Kaufmann. 
[25] Niblett T. k Bratko I. (1986) "Learning De­cision Rules in Noisy Domains", Ezpert Sy­stems, Brighton, UK. 
BENCHMARKING INDICATES RELEVANCE .. .
[26] Quinlan J.R. (1983) "Learning Efficient Clas­sification Procedures and Their Applica­tion to Chess End Games", in MichalT ski R.S., Carbonell J.G. & Mitchell T.M. (Eds.), Machine Learning: an Artificial In­telligence Approach, Tioga Publishing, Palo Alto, USA. 
[27] Quinlan	 J.R. (1986) "Induction of Decision Trees", Machine Learning 1, Kluwer Acade­mic Publishers. 
[28] Quinlan J.R.	 (1987) "Generating Production Rules From Decision Trees", Proč. of H CA I, Milano. 
[29] RendeU L., PoweU B., Cho H. & Seshu 
R. (1988) "Improving the Design of Rule-Learning Systems", Proč. of The 8th In­ternational Workshop on Ezpert Systems & Their Applications, Avignon, France. 
[30] RendeU L.,	 Seshu R. & Tcheng D. (1987) "Layered Concept-Learning and Dynami­caUy-Variable Bias Management", Proč. of IJCAI, Milano.' 
[31] Shannon C.E. k Weaver W. (1964) The Ma­thematical Theory of Communications, Ur­bana, Illinois, University of Illinois Press. 
[32] ShortUffe	 E.H. (1976) Computer-Based Me­dical Consultations: MYCIN, American El­sevier PubUshing. 
[33] Tecuci G. (ed.) (1993) "Special Issue: Multi-strategy Learning", Informatica, 4 (1993). 
[34] Tecuci G.,	 Kedar S. & Kodratoff, Y. (ed.) (1993) Proč. of IJCAI-93 Workshop Machine Learning and Knowledge Acquisition, France. 
[35] Utgoff P.E.	 (1986) Machine Learning of In­ductive Bias., Kluver Academic Publishers.