Implementation 
of a Slovene Language-Based 
Free-Text Retrieval System 
A study submitted in fulfilment of 
the requirements for the degree of 
Doctor of Philosophy 
at 
The University of Shemeld 
by 
Mirko Popoviè 
Department of Information Studies 
June 1991 
Mirko Popoviè June 1991 
Implementation of a Slovene Language-Based Free-Text Retrieval System 
Abstract 
This thesis is concerned with providing end-user access to bibliographic databases 
in Slovenia. A statistically-based approach to document retrieval, in particular near-
est neighbour searching, is selected as a means to achieve this goal. The following 
two main questions are investigated in the context of this thesis: (a) are statistically-
based techniques applicable to Slovene information retrieval systems, and (b) could 
statistically-based techniques provide a framework for developing a multi-lingual infor 
mation retrieval system? 
After providing a theoretical background to the experimental work, the design of 
a stop-word list and a stemming algorithm for Slovene is discussed. The resulting 
stop-word list contains a total of 1,593 non-content bearing vvords. Two stemming 
algorithms are described, one context-free and the other context-sensitive; the latter 
is found to be far more effective in operation, owing to the large number of context 
sensitive and recoding rules that are required to reflect fully the morphology of Slovene. 
The retrieval effectiveness of this stemming algorithm is evaluated within the best-
match context, using the Slovene version of the INSTRUCT package. The performance 
of the stemming algorithm is tested by its comparison with two other types of text 
representation, i.e., manual right-hand truncation carried out by a trained intermediary, 
and unstemmed vvords. The results of this comparative evaluation reveal the following: 
(a) there is a significant performance difference between automatic word conflation 
and unstemmed processing of the Slovene text; (b) there is no significant performance 
difference betvveen automatic stemming and manual right-hand truncation, carried out 
by a trained intermediarv. It follows that one of the important components of an 
information retrieval system, i.e., word conflation, can be automated in Slovene systems 
with no average loss of performance, thus allowing users easier access to the systems. 
Having obtained good performance results with the employment of the Slovene 
stemming algorithm, a multi-lingual experiment is described. Its main objective is to 
test the performance of statistically-based techniques in two different languages, i.e., 
Slovene and English. A detailed analysis of performance results confirms only one of 
three main hypotheses. Although the experiment on the identification of stem variants 
produces a similar number of related terms from both English and Slovene dictionary 
components of the inverted rile, the other two hypotheses are rejected, i.e., (a) process 
ing of the English documents and queries does not produce more or less identical hits 
to those retrieved from the Slovene database; (b) the Slovene version of INSTRUCT 
produces significantly better performance results than its English equivalent. The em-
ployment of a failure analysis reveals two main causes of performance difference, i.e., 
the frequent occurrence of synonyms and other related terms, and the automatic word 
conflation carried out by the two different stemming algorithms. 
On this basis, conclusions and suggestions for further work are given. It is em-
phasized that advanced, statistically-based techniques of information retrieval will be 
firmly established in Slovenia only if they can be enhanced with refinements which 
allow a multi-lingual approach to document retrieval. 
Acknowledgements 
I would like to give my sincere thanks to ali those who helped me during the course of 
this PhD project: in particular, Dr Peter Willett, my supervisor, for his encouragement, 
guidance, and professionalism. 
Unfortunately, the participants who gave so much of their valuable tirne to evalua-
tion are too numerous to mention, but this project would not have been possible without 
them. However, I would like to emphasize a work carried out by Boris Košorok, a pro-
fessional intermediary at the National & University Library in Ljubljana; without his 
help, it would have been impossible to complete ali the experimental tests. 
Sincere thanks are offered also to Richard Gilbert, of the Computing Centre at the 
University of Sheffleld, who gave me valuable information on how to use international 
e-mail facilities. His knowledge was of a particular importance in the second phase 
of my research, while I was working in Ljubljana, and frequent communication with 
Sheffield became a rule. 
Since English is not my native language, I would like to thank my colleagues Janey 
Cringean, Val Gillett, and Helen Grindley for their time and effort in clarifying some 
of the expressions in my thesis. In particular, I would like to show gratitude to Helen 
Grindley for taking the time to read the final draft of my thesis and for providing useful 
comments. 
Thanks are owed also to the British Council, the Ministry of Culture in Slovenia, 
the Ministry of Research and Technology in Slovenia, and the National & University 
Library in Ljubljana, for their financial support. 
Finally, I would like to thank my wife, Breda, for her encouragement, time and 
patience. This PhD thesis is Breda's as much as it is mine. 
Contents 
1 Recent Trends In Document Retrieval 3 
1.1 Introduction 3 
1.1.1 Document retrieval - a definition
1.1.2 Some characteristics of current document retrieval svstems ... 4 
1.2 Automatic indexing 8 
1.2.1 Comparison between manual and automatic indexing 8 
1.2.2 Statistical approach to automatic term selection 11 
1.2.3 Word conflation 17 
1.3 Best-match searching
1.3.1 Comparison of conventional (Boolean) and best-match retrieval . 17 
1.3.2 Implementation of best-match searching 20 
1.4 Weighting of search terms 22 
1.5 Conclusions 27 
2 Automatic Word Conflation 29 
2.1 Introduction
2.2 Characteristics of stemming algorithms 32 
2.2.1 Types of stemming algorithms 3 
2.2.2 Compilation of a suffix list 3
2.2.3 Mode of operation of stemming algorithms 34 
2.2.4 Conditional rules 35 
2.2.5 Recoding rules 6 
2.2.6 Users' needs 7 
2.2.7 Language dependencv of a stemming algorithm 38 
2.2.8 Some other characteristics of stemming algorithms 39 
i 
2.3 Conflation algorithms: a review 40 
2.3.1 Lovins 4
2.3.2 Davvson 1 
2.3.3 RADCOL
2.3.4 INSPEC 42 
2.3.5 Automatic generation of sufflx lists 43 
2.3.6 Hafer and Weiss 4 
2.3.7 SMART 46 
2.3.8 MORPHS
2.3.9 Cercone 7 
2.3.10 MARS 4
2.3.11 Porter 9 
2.3.12 OKAPI 50 
2.3.13 CITE 2 
2.4 Evaluation of conflation algorithms for information retrieval 53 
2.4.1 Automatic stemming vs. full word retrieval 5
2.4.2 Automatic stemming vs. right-hand truncation 54 
2.4.3 Evaluation of different conflation algorithms
2.5 Conclusions 56 
3 Main Characteristics of the Slovene Language 57 
3.1 Introduction
3.2 The Slovene alphabet and pronunciation 59 
3.2.1 Vowels 5
3.2.2 Consonants 60 
3.3 Morphological structure of the Slovene language 61 
3.3.1 The concept of word formation 6
3.3.2 Inflectional morphologv of Slovene 2 
3.3.3 The categorv of gender 66 
3.3.4 The categorv of number 7 
3.3.5 The categorv of èase
3.3.6 The categorv of degree 69 
3.3.7 Grammatical categories of the verbal forms 6
3.4 Types of morphemic alternations 72 
ii 
3.4.1 Vocalic alternations 72 
3.4.2 Consonantal alternations 4 
3.4.3 Truncation 5 
3.4.4 Complexity of Slovene morphologv, using an example 75 
3.5 Conclusions 77 
4 Development of a Stemming Algorithm for the Slovene Language 79 
4.1 Introduction
4.1.1 Information retrieval research in Slovenia 79 
4.1.2 Computer analvsis of the Slovene language in medicine 81 
4.1.3 A general framework for the design of a stemming algorithm for 
the Slovene language 84 
4.2 A methodological framework of the experimental work 85 
4.3 Development of a stop-word list 87 
4.3.1 Frequency distribution of terms 8
4.3.2 Design of the Slovene stop-word list 97 
4.3.3 Evaluation of the new stop-word list 102 
4.4 Design of a stemming algorithm 105 
4.4.1 Development of a suffix list
4.4.2 Design of the frequency algorithm 110 
4.5 Design of the new stemming algorithm for the Slovene language 116 
4.5.1 Development of the suffix list 11
4.5.2 The new stemming algorithm for the Slovene language 119 
5 INSTRUCT: an INteractive Svstem for Teaching Retrieval Using 
Computational Techniques 130 
5.1 Introduction 13
5.2 Original version of INSTRUCT - program facilities 131 
5.2.1 The user interface 132 
5.2.2 Query formulation
5.2.3 Searching 133 
5.3 Enhancements to INSTRUCT 134 
5.3.1 The user interface 5 
5.3.2 Query expansion on the basis of term co-occurrence 135 
5.3.3 Cluster-based searching 136 
iii 
5.3.4 Post-search options 136 
5.4 Main modules of the INSTRUCT package 138 
5.5 The use of INSTRUCT at the University of Sheffield 140 
5.5.1 Use in teaching programmes 14
5.5.2 Use in research programmes 1 
5.6 Processing of documents and queries in a Slovene language free-text 
retrieval svstem 142 
5.6.1 The Slovene version of INSTRUCT 143 
5.7 An example of best-match searching using the Slovene version of IN 
STRUCT 146 
5.8 Conclusions 151 
6 Evaluation of the Stemming Algorithm for Slovene IR - Experimental 
Environment 152 
6.1 Introduction
6.2 Laboratory versus operational tests 153 
6.3 Test collection 155 
6.3.1 Documents 6 
6.3.2 A set of queries 157 
6.3.3 Relevance assessments 161 
6.4 Text representation modules in INSTRUCT 164 
6.4.1 Automatic stemming 165 
6.4.2 Non-conflation 166 
6.4.3 Manual right-hand truncation 16
6.5 Methods for the analvsis of data 9 
6.6 Conclusions 172 
7 Evaluation of the Stemming Algorithm for Slovene IR - Analvsis of 
Results 173 
7.1 Introduction 17
7.2 Collection of data 4 
7.2.1 Searching 17
7.2.2 A pool of retrieved documents 175 
7.3 Analvsis of results 17
7.3.1 Recall and precision as measures of retrieval effectiveness .... 175 
i v 
7.3.2 Significance tests 179 
7.3.3 Additional comparison of automatic stemming and manual right-
hand truncation 187 
7.4 Conclusions 192 
8 Multi-Lingual Approach to Document Retrieval 194 
8.1 Introduction 19
8.2 Purpose of the experiment 195 
8.3 Background for the experiment 6 
8.3.1 Statistically-based techniques in multi-lingual IR systems .... 199 
8.4 Methodology 201 
8.4.1 The test environment 20
8.4.2 The test procedures 8 
8.5 Analysis of results 210 
8.5.1 Multi-lingual experiment, using best-match searching facility . . 210 
8.5.2 A multi-lingual experiment based on the identification of word 
variants 231 
8.6 Conclusions 5 
9 Conclusions 237 
9.1 Introduction
9.2 Summary of results and conclusions 238 
9.2.1 Development of a stop-word list and a stemming algorithm . . . 238 
9.2.2 Retrieval effectiveness of the stemming algorithm 242 
9.2.3 Multi-lingual approach to document retrieval 244 
9.3 Suggestions for further work 245 
A A list of consulted literature 247 
B The list of natural language queries 248 
C The list of queries as processed by the trained intermediary 251 
D The list of English language queries 254 
v 
Preface 
The provision of end-user searching facilities has been recognized as the only way to 
remove a barrier between the original source of a query and the query's answer. This 
thesis is therefore aimed at increasing the possibilities of easy end-user access to bib-
liographic databases in Slovenia. At present, end-users in Slovenia are faced not only 
with a growing number of bibliographic and other types of databases, but also with a 
multi-lingual information retrieval environment. In other words, they are surrounded 
by document collections, written in many different languages (i.e., Slovene and other 
Yugoslav languages, major European languages). In addition, ali software systems (e.g, 
ATLASS, TRJP) available for accessing these databases are typical of current retrieval 
software elsevvhere in that they are based on Boolean searching, with professional inter-
mediaries being used to carry out on-line searches on behalf of end-users. Consequently, 
modern, non-conventional methods and techniques of information retrieval which al-
low direct, end-user interaction with the system are neither incorporated into existing 
retrieval systems in Slovenia, nor has much research been carried out in this area. 
One of the main research areas in information retrieval is the development of al-
gorithmic procedures which allow the computer to undertake many of the functions 
of a trained intermediary. This approach, based on the use of a range of statistical 
techniques—also known as the statistically-based approach to document retrieval—has 
been used in this thesis to develop a Slovene language-based free-text retrieval system. 
Therefore, the main problem which was investigated in the context of this PhD project, 
is contained in the following two questions: 
1. Are statistically-based techniques applicable to Slovene information retrieval sys-
tems? 
2. Could statistically-based techniques provide a framework for developing multi-
lingual information retrieval systems? 
The thesis begins with an introduction and description of recent trends in document 
retrieval (Chapter 1). Particular attention is given to the statistically-based approach 
1 
to information retrieval which is based on the following main components: automatic 
indexing, nearest neighbour searching, and term weighting. Most of the statisticallv-
based techniques are independent of a particular language, with one exception, i.e., 
processing by a stemming algorithm. Chapter 2 therefore contains a detailed review 
of the automatic word conflation techniques which can be used to increase the effec-
tiveness and efficiency of information retrieval systems. The development of a Slovene 
language-based free-text retrieval system required the design of an effective stemming 
algorithm for the Slovene text. Since such a design must take into account the lan 
guage^ morphological structure, Chapter 3 presents the main characteristics of con-
temporary Slovene, with particular reference to its inflectional morphology. The design 
of a stop-word list and a context-sensitive stemming algorithm for Slovene—together 
with their evaluation—are covered in Chapter 4. However, the retrieval effectiveness of 
these two language-dependent procedures cannot be evaluated without their incorpo-
ration into a retrieval system. The INSTRUCT package was used as a test bed for this 
experiment and is thus outlined in Chapter 5, together with a description of a Slovene 
version of INSTRUCT which required the implementation of language-dependent pro 
cedures. Chapter 6 discusses a test environment (a test collection, text representation 
modules) which was built to evaluate the effectiveness of the Slovene stemming algo 
rithm. The performance of this algorithm was compared with the two other types of 
text representation, i.e., manual right-hand truncation and non-conflation. Chapter 
7 deals with results obtained from this experiment. Chapter 8 compares the perfor 
mance of the Slovene information retrieval system with its English equivalent in order 
to find out whether statistically-based techniques—as implemented in INSTRUCT— 
could provide a framevrork for a multi-lingual approach to document retrieval. Finally, 
Chapter 9 presents the conclusions and suggests areas for future work. 
Appendices A-D contain a list of literature which was consulted to describe the 
main characteristics of the Slovene language, and the three sets of queries that were 
used. A floppy disk contains the lists of stop-words and suffixes that were identified 
during the project. 
2 
Chapter 1 
Recent Trends In Document 
Retrieval 
1.1 Introduction 
1.1.1 Document retrieval - a definition 
The term document retrieval can best be described within the general framevrork of 
information retrieval which is concerned with the representation, storage, organization, 
and accessing of information items (Salton and McGill, 1983). The term information 
retrieval covers a wide range of disciplines with the emphasis on non-numeric computing 
(e.g., document retrieval, natural language processing, database management svstems). 
Historically, however, the term information retrieval (IR) has been used to cover the 
storage, processing and retrieval of information from databases containing bibliograph-
ical details on documents of ali kinds (e.g., books, journal articles, reports, patents). 
In these databases, the input information consists of natural language text (e.g., bib-
liographic citations, abstracts), and the output to a search request comprises sets of 
references. These references are intended to provide end-users with information about 
items of potential interest. Thus, information retrieval svstems can alternativelv be 
described as document retrieval svstems, or even more preciselv, as reference retrieval 
3 
systems (VVillett, 1988a). The main task of these systems is to (Salton and McGill, 
1983): 
• retrieve documents and references; 
• store natural language texts; and 
• process users' queries. 
However, the rapid spread of full-text databases (Tenopir, 1984) as a result of changing 
technology (e.g., the trend towards electronic publishing, use of optical readers to scan 
text) also leads to the application of document retrieval techniques to textual materials 
themselves. 
1.1.2 Some characteristics of current document retrieval svstems 
In today's changing society in which information and knowledge play a crucial role in 
socioeconomic development there is increasing interest in the use of document retrieval 
systems. The current trends can be summarized in Goldsmith's vrords (1982): 
"... computerized text retrieval has been a concept and a reality for some 
years but has not really been a workable proposition until recently, probably 
because of lack of sufficient computer power and because of a general lack 
of awareness, and hence commitment, by those most likely to benefit" (p. 
41). 
Computerized information retrieval systems have now been in use for almost thirty 
years, starting with the pioneering work in KWIC indexing, followed by the develop 
ment of SDI systems, current awareness and retrospective searching. However, it is tech-
nological progress, i.e., increases in computer processing power and capacity, improve-
ments in telecommunications, reductions in the cost of direct access storage devices— 
together with increasing user demand for accurate and up-to-date information—which 
has led to a grovving number of both the large on-line, external databases, and internal, 
so-called in-house databases. In addition to a growing number of on-line bibliographic 
4 
databases (Williams, 1985), there is increasing interest in other types of data, and thus 
different kinds of databases (e.g., numerical databanks, full-text databases). 
To make efficient and effective use of textual data in both external and internal 
databases sophisticated methodologies and techniques are needed to store, process, 
and transmit information. Surprisingly, despite great technological developments, it is 
often claimed that current information retrieval systems are unable to deal effectively 
with the ever increasing growth of information and that current operational capabilities 
remain at a relatively elementary stage (Salton and McGill, 1983). 
Specifically, there are two essential features in existing document retrieval systems: 
• inverted file organization; 
• use of Boolean operators. 
In addition to the source data file, inverted file organization consists of the following 
two main files: the dictionary file, and the postings file. In the dictionary file, indexing 
terms are listed in alphabetical order along with the total number of documents in 
which they occur. The postings file lists the indexing terms with the accession number 
of each of the documents in which a term appears. This file is used to "point" to 
the documents in the main sequential data file so that the full record can be retrieved. 
Taken together, the dictionary and postings files are generally referred to as an inverted 
file. 
Although inverted file organization requires additional storage and maintenance, 
its crucial advantage over serial file organization is its very rapid response to queries. 
Instead of inspecting each document in a database, only a few documents need to 
be matched against a query because the data file is inverted to provide indexes to 
documents containing the query terms. 
Such a file organization, using individual terms and document reference numbers, 
can handle Boolean operators particularly easily by translating AND, OR, and NOT 
operations into set intersection, set union, and set difference, respectivelv. Conse-
quently, Boolean searching techniques—with some additional facilities, e.g., truncation 
5 
and proximity searching—dominate in the present document retrieval systems. 
Why, then is it so often reported in the literature that these systems are not capable 
of entirely meeting users' needs, and, moreover, that they are hostile to end-users 
(Cleverdon, 1984)? It is stressed by Pollitt (1986) that, 
"... computerised searching services will not have their full impact upon 
user communities until direct user searching is widespread" (p. 1). 
One of the consequences of the increasing number of large databases is the appearance 
of so-called professional intermediaries who are required to carry out on-line searches 
on behalf of end-users. These people are needed to help formulate user requests and to 
provide guidance on how the system is organized, on what materials are available, and 
how to search for and locate the desired items. This is a very questionable situation 
because it is very difficult to determine what the user really requires and the interme-
diary may seldom be aware of a user's real needs. It is widely believed that end-users 
will be able to make their own requests on-line when search processes are simplified or 
made more "friendly". 
A need for professional intermediaries is recognized as a limitation of current in-
formation retrieval systems since either end-users are excluded from the search process 
or extensive training is required to carry out searches in a cost-effective manner. It is 
claimed by Cleverdon (1984) that, 
"... the error rather lies in a failure to exploit correctly the resources 
of the computer, due to a lack of understanding of the nature of retrieval 
systems" (p. 38). 
At present there are two main research areas in information retrieval whose main aim 
is to enable end-users to carry out searching in both an efficient and effective manner. 
One of these areas is research into expert intermediary systems to provide intelligent 
front-ends to bibliographic databases. This approach, also known as a knowledge-
based approach to information retrieval systems, ušes expert systems techniques (e.g., 
rule-based programming) to encode the expertise possessed by a trained intermediary. 
This research has resulted in some operational systems, for example CONIT (Marcus, 
6 
1983), and CANSEARCH (Pollitt, 1986), which enable end-users to undertake searches 
without the knowledge or training of a professional intermediary. The main advantage 
of these systems is communication with the database via the user's natural language, 
the main drawback a domain dependence. 
A quite different area of research in enabling on-line searching to be carried out 
by end-users is based on the development of algorithmic procedures which allow the 
computer to undertake many of the functions of a trained intermediary. This ap-
proach, based on the use of a range of statistical techniques, is also known as the 
statistically-based approach to information retrieval. Research in this area which is also 
concerned with retrieval effectiveness, i.e., to retrieve a larger amount of relevant ma 
terial than Boolean systems, has resulted in some operational systems, e.g., SMART, 
SIRE (Salton and McGill, 1983), MASQUERADE (Brzozowski, 1983), CUPID (Porter, 
1982), CITE (Doszkocs, 1983), INSTRUCT (Hendry et al., 1986a), and MUSCAT 
(Porter and Galpin, 1988). These, and some other systems—although in their early 
stages—have clearly demonstrated many advantages over conventional, Boolean text 
retrieval systems. At the heart of this new generation of free-text retrieval systems are 
the following techniques: 
• automatic indexing; 
• best-match retrieval; 
• term weighting. 
There is no doubt that with developments in hardware and software technology, to-
gether with the trend of increasing use of textual databases, both areas of research 
will become very attractive as an alternative to conventional, Boolean information re 
trieval. This has already been confirmed by the implementation of non-conventional 
information retrieval techniques in commercial retrieval systems, as for example STA 
TUS/^ (Pape and Jones, 1988). Moreover, there are some indicators (VVade et al., 
1988) that the best results in information retrieval can be achieved by the integration of 
knowledge-based and statistically-based retrieval techniques. However, tests on a much 
7 
larger scale are needed to identify the optimum way of integrating the two approaches. 
Therefore, this chapter will describe the statistically-based approach to information 
retrieval which has been in existence for more than two decades and which is begin-
ning to show successful results in various operational systems. In the first section, a 
comparison between manual and automatic indexing will be outlined, followed by a 
description of automatic term selection and word conflation. The second section will 
be concerned with best-match searching as an alternative to Boolean searching. In the 
third section the main emphasis will be on describing search term weighting models. 
In the last section, some other areas of advanced research in information retrieval will 
also be briefly outlined. 
1.2 Automatic indeocing 
1.2.1 Comparison between manual and automatic indexing 
It is generally accepted that of ali the operations required in information retrieval, 
the most crucial and probably the most difficult one consists of assigning appropriate 
terms and identifiers capable of representing the content of documents in a particular 
database. These terms, known also as descriptors, indexing terms, or keywords act as 
secondary keys for the retrieval of those documents which appear most similar to the 
given query formulations. 
Indexing can be performed either manually by trained experts or automatically by 
computers. The first predictions about the future of content analysis tasks were in 
favour of manual indexing: 
"It is very likely that manual indexing (content analysis) by cheap cler-
ical labor will stili, on average, be qualitatively superior to any kind of 
automatic indexing. ... neither the assignment of topic terms to a given 
request, nor the reformulation of a request are processes which could con-
ceivably be adequately mechanized, contrary to some speculation in this 
direction" (Bar Hillel, 1962; cf. Salton, 1986, p. 1). 
Although the perception about "cheap clerical labour" has drastically changed since 
8 
such predictions, indexing is stili been carried out manually by trained indexers. To 
this day, manual indexing is the rule rather than the exception in most operational 
environments. 
In many manual indexing situations, where trained personnel are involved, the use 
of a controlled indexing language is preferred (i.e., a single standard term or phrase 
represents a wide variety of related terms and descriptions). This means that a variety 
of aids are made available to the indexer to control the indexing process, including for 
example a thesaurus that contains lists of equivalent and related terms for each standard 
thesaurus entry, or hierarchical dictionaries that contain general term arrangements 
capable of identifving broader and narrower terms for the various dictionary entries. 
In addition, to obtain quality, accuracy, and consistency of performance in manual 
indexing, considerable demands are placed on the indexing personnel. It is expected 
that trained indexers should not only be knowledgeable about the subject matter of the 
database, but should also be familiar with the available indexing vocabularies and prac-
tices (e.g., number of terms, thesaural relationships). Furthermore, the performance 
of the various indexers should be sufficiently consistent to guarantee that similar doc-
uments are identified by comparable indexing entries. Thus, a great deal of training, 
experience and knovvledge is required from trained indexers. 
If the above demands are met, it is possible in principle to generate very useful 
manual indexing products. However, as said by Salton (1986), the practice of manual 
indexing is often different from the theory. The results of an investigation into various 
aspects of the storage and retrieval process (Cleverdon, 1984) indicated that: 
• if two people or groups of people construct a thesaurus in a given subject area, 
only 60% of the index terms may be common to both thesauri; 
• if two experienced indexers index a given document using a given thesaurus, only 
30% of the index terms may be common to the two sets of terms; 
• if two intermediaries search the same question on the same database on the same 
host, only 40% of the output may be common to both searches; 
9 
• if two scientists or engineers are asked to judge the relevance of a given set of 
documents to a given question, the area of agreement may not exceed 60%. 
Thus, even if the indexing process were carried out accurately, and at the right level 
of detail, it is actuallv impossible to perform the indexing procedure consistentlv since 
more than one indexer will necessarily be needed in practice. This inevitable inconsis-
tency affects retrieval performance and therefore leads to doubts about the potential 
advantages of strictly controlled, manually applied indexing languages. 
The disadvantages and limitations of manual indexing have led to an increasing 
interest in an alternative approach to content analysis, i.e., in automatic indexing, where 
the selection of content identifiers is carried out with the aid of computing equipment. In 
automatic indexing, the problems caused by the use of a controlled language thesaurus 
and variations in indexing are avoided; in addition, there is also the possibility of 
natural-language searching in a document collection. 
Research into automatic indexing has resulted in a wide range of techniques, which 
have been implemented in either experimental or operational retrieval systems. Tests 
on these systems have shown that simple automatic indexing methods are fast and 
inexpensive, and produce a recall (i.e., proportion of relevant material retrieved) and 
precision (i.e., proportion of retrieved material actually relevant) performance at least 
equivalent to that obtainable in a manual, controlled term environment (Salton and 
McGill, 1983). 
Results of research work have also led, according to Willett (1988a), to a general 
agreement that an automatic indexing system should consist of the following compo-
nents: 
• a term selection module, which is responsible for the selection of descriptors on 
the basis of a text analysis of a document; 
• a conflation procedure, which is used to reduce variants of a word to a single 
canonical form; 
• a weighting mechanism, which assigns measures of relative importance to the 
10 
words which have been selected as document identifiers. 
At this point it is necessarv to emphasize that a term weighting mechanism can be used 
either for the automatic selection of indexing terms or for weighting of query terms at 
search tirne. Since more useful results have been obtained by the implementation of 
different weighting schemes in the latter approach, a term vveighting mechanism will be 
outlined in connection with best-match searching. Thus, only a term selection module 
and a conflation procedure will be described in this section. 
1.2.2 Statistical approach to automatic term selection 
It has already been said that the indexing task consists of assigning to each stored item 
terms capable of representing document content. The first and most obvious plaèe 
where appropriate descriptors might be found is the text of the documents themselves, 
or the text of document titles and abstracts. Thus, this section is concerned with 
methods for the automatic extraction of content terms from documents and document 
excerpts. 
In automatic indexing, there have historically been two main approaches to term 
selection: 
• linguistic approach, based on semantic and syntactic theories of automatic index-
ing; 
• statistical approach, based on the concept of word frequencies. 
Linguistic techniques, very popular in the 1960s, have not generally met expectations 
in automatic indexing research (VVillett, 1988a). Although they were seen as quite 
a desirable computing technique, their integration into information retrieval did not 
prove to be as easy as was initially hoped. The idea of integration was dropped, 
because the complexity of natural language, the complexity of its processing, and the 
complexity of natural language texts in general, were grossly underestimated (Smeaton, 
1990). Although a great amount of research work had been put, for example, into the 
11 
development of sophisticated phrase analysis methods, it has been realized that simple 
keyword extraction techniques perform consistently better (Salton, 1986). As a result 
of this, researchers turned their attention to statistically-based methods for researching 
information retrieval processes. 
Nevertheless, the latest results of research in this area—automatic translation be-
tween natural languages, natural language interfaces to databases, etc.—have indicated 
that this situation is changing; there is again a lot of work on re-applying techniques for 
automatic natural language processing to information retrieval problems. This research 
has concentrated mainly on semantic and syntactic construction of term phrases. It 
is known that the assignment of term phrases—rather than single terms—is helpful 
in providing narrow, specific identifiers when the original indexing vocabulary is too 
broad. An example of the linguistic approach is the study by Sparck Jones and Tait 
(1984) which has involved the use of a natural language parser to generate grammat-
ically acceptable noun phrases from sentence-length natural language queries. These 
phrases can then be searched for in document abstracts. 
However, there is stili one problem to be solved with the phrase assignments: there 
is no easy way for generating only the useful identifiers and rejecting the useless ones 
(Salton, 1988). Since a reliable discrimination method is not currently available, a lot of 
work has stili to be carried out in this area, as indicated by Fagan (1989). Recently, an 
interesting experiment has been described by Keen (1991a) who tested the use of term 
position information in non-Boolean retrieval systems. The best performance results 
were achieved by computing proximate matching term pairs in sentences plus a distance 
component. This supports the ideaof the further development of term position devices 
for use in retrieval experiments. 
As pointed out above, unsatisfactory results in the linguistic approach to automatic 
indexing in the 1960s have strengthened research interest in the use of statisticallv-
based technigues for automatic term selection. A starting-point for this research is 
based on the hypothesis that the frequency of occurrence of distinct words in a natural 
language text is correlated with the importance of these words for the content repre-
12 
sentation. Specifically, if ali words were to occur randomly across the documents of a 
collection with equal frequencies, it would be impossible to distinguish betvreen them 
using quantitative criteria. 
One of the first to suggest that words occur in natural language text unevenly, and 
therefore, classes of vrords are distinguishable by their occurrence frequencies, was Luhn 
(1957): 
"A notion occurring at least twice in the same paragraph vrould be 
considered a major notion; a notion which occurs also in the immediately 
preceding or succeeding paragraph would be considered a major notion even 
though it appears only once in the paragraph under consideration. Nota-
tions for major notions as just defined would then be listed in some standard 
order as representative of that paragraph" (cf. Salton, 1986, p.l). 
Using the constant rank-frequency law of Zipf, based on a general "principle of least 
effort" (i.e., author tends to repeat certain words instead of coining new and differ-
ent words; the most frequent words tend to be short function words, e.g., AND, OF, 
BUT), Luhn suggested that a term selection procedure should be based upon the col 
lection frequency of each keyword. Thus, he introduced the concept of the so-called 
"resolving power" of the index vrords extracted from document texts which should iden-
tify relevant items and distinguish them from non-relevant documents in a collection. 
Luhn (see Salton and McGill, 1983) emphasized that high-frequency terms are often 
non-specific and unable to discriminate sufficiently between relevant and non-relevant 
documents; very low-frequency terms can be good indicators of document relevance, 
but they contribute relatively little to the retrieval activity since they are most unlikely 
to be specified in a query. The concept of "resolving power" is therefore based on 
those index terms that are neither too rare not too common, i.e., having intermediate 
frequencies of occurrence. 
Experiments, following these original ideas, have shown that they are not usable 
as stated in a practical operational retrieval environment. Salton and McGill (1983) 
pointed out the following limitations of the simple collection frequency approach: 
• the elimination of high-frequency vrords may produce losses in recall; 
13 
• on the contrary, the elimination of low-frequency terms may produce losses in 
precision; 
• there is no objective criterion for the selection of thresholds in order to distinguish 
the useful medium-frequency terms from the remainder. 
In addition, as it is noted by Salton (1986), words occurring frequently in the texts 
of particular documents could not be used to distinguish these documents from the 
remaining texts of a collection if their occurrence frequency were also high in ali other 
available documents. 
The above insights were used to produce several weighting models for term selec 
tion, including a document frequency scheme, a signal-noise ratio model, and a term 
discrimination model. 
The document frequency model is based on the calculation of how frequently a 
particular term occurs within both the text of an individual document and a document 
collection. The assumption is that a good term should have a high term frequency 
in a particular document, but a low overall frequency in the collection. Thus, good 
indexing terms are those whose occurrences are restricted to a relatively small number 
of documents (Salton and McGill, 1983). 
The signal-noise ratio model (Salton and McGill, 1983) is based on data about the 
"concentration" of a term in the document collection. For a perfectly even distribution, 
when a term occurs an identical number of times in every document in a collection, 
the noise is maximized. Thus, a relationship exists between noise and term specificity, 
because broad, non-specific terms tend to have a more even distribution across the 
documents of a collection, and hence high noise. In contrast to specific terms, they do 
not contribute to a reduction of uncertainty about the document content. This model 
favours terms with very low document and collection frequencies, and in a retrieval 
environment usually distinguishes one or two specific documents from the remainder of 
the collection. 
The third approach to the automatic selection of indexing terms is known as the 
14 
term discrimination model (Salton, 1975) vvhich measures the degree to which the use 
of the term will help to distinguish the documents from each other. The model suggests 
that the ideal retrieval environment would be multi-dimensional index term space in 
which ali of the documents are as far apart as possible. The discrimination abilities of 
different terms can then be evaluated by the change in the inter-document separations 
when a term is, and when a term is not, used for indexing. A good term will be one that 
helps to increase the separation of ali of the documents, while the assignment of a poor 
term will tend to decrease inter-document separations, i.e., to increase space density. 
Results of experiments over several years (Salton and McGill, 1983) have shown sur-
prisingly similar conclusions to those noted by Luhn (1957) that the best discriminators 
are medium frequency terms in the collection in which they occur. However, the results 
of the experiments carried out by Biru et al. (1989) indicated that medium frequency 
terms are not necessarily the best discriminators when relevance data is available; while 
these terms may include some of the best discriminators, they also include those with 
very poor discriminatory abilities. 
Although some other models for term selection are reported in the literature (Salton 
and McGill, 1983), a lack of a general agreement about the most appropriate strategy 
is more than evident. While the term discrimination model suggests the use of indexing 
words with medium frequency of occurrence and the signal-noise ratio model proposes 
words with low document and collection frequencies, the document frequency model 
defines as the most useful words with high document frequency but low collection 
frequencies. 
A failure to find the most appropriate means for automatic selection of indexing 
terms has consequently led to the current trend of using ali of the keywords from a 
document or query text, and then refining them with an appropriate weighting scheme 
at search tirne (Willett, 1988a). The only exceptions are some of the very high frequency 
terms which are eliminated by means of a stop-word list. This list, also known as a 
negative dictionarv, usually contains so-called non-content bearing words, i.e., function 
words (AND, OR, WITH, FOR, etc), words from phrases that happen to be a part 
of a query (I WOULD LIKE, HAVE YOU GOT, etc), and also specialty words in 
15 
the particular databases (e.g., LIBRARIES, in a database containing documents about 
librarianship). Although in the English language these stop-word lists usually contain 
about 300 common words, it is said by Salton and McGill (1983) that these terms 
comprise 40 to 50% of the text words; this is owing to the hyperbolic character of the 
Zipf law relationship. Thus, elimination of these terms increases the separation of aH of 
the documents and also contributes to a reduction in the dictionary size of the inverted 
file. 
It should be stressed that the automatic selection of indexing terms consists merely 
of single words. Starting from the notion that single term sets extracted from docu 
ments can offer only a simplified picture of the actual text content, suggestions for more 
refined content identifiers have also been put fomard. On the one hand, in order to im-
prove the recall performance of single terms (particularly those with low frequencies) 
the possibility of adding related terms derived from thesaurus and word association 
maps has been under investigation for a long tirne (Salton, 1986). On the other hand, 
in order to improve the precision performance of single terms (particularly those with 
high frequencies) the idea of identifying term phrases has also been seriously considered 
(Salton and McGill, 1983). Despite many experiments, the expectation of research in 
these two areas has not been met (Salton, 1988). It is known, for example, that the 
manual construction of a thesaurus is an enormous task and difficult to implement in 
operational environments in some subject areas. In addition, results of experiments 
in the automatic creation of a thesaurus have shown that only 20% of automatically 
derived associations between pairs of terms are semantically valid (Salton, 1986). Fi 
nali, the current experimental results on the automatic identification of term phrases 
are not sufficient to determine whether this approach does indeed result in increases in 
system performance; it remains to be seen how such processing can be best employed 
in information retrieval systems. 
It follows that up to the present, single-term indexing methods have performed 
much better than other, sophisticated approaches to term extraction. In addition, the 
implementation of word conflation techniques has led to an increased recall performance 
for single terms. 
16 
1.2.3 Word conflation 
One of the main problems encountered in automatic indexing and searching is the 
variation in semantica,lly related word forms in free text. The differences are mainly 
caused by the requirements of grammar in a particular language, e.g., BIBLIOGRA-
PHY and BIBLIOGRAPHIC in English (or BIBLIOGRAFIJA and BIBLIOGRAFSKI 
in Slovene). The main problem to be sol ved is therefore to reduce the variants of a 
word to a single canonical form: this process is known as conflation. 
Conflation is discussed in detail in Chapter 2. 
1.3 Best-match searching 
1.3.1 Comparison of conventional (Boolean) and best-match retrieval 
Nearly ali current document retrieval systems use Boolean operators (AND, OR, and 
NOT, possibly augmented by truncation and word proximity functions) in the searching 
procedure. Despite some advantages in linking chosen search terms by Boolean oper 
ators (e.g., the possibility of constructing highly specific and discriminating queries) 
there are some serious limitations to the conventional Boolean retrieval model. The 
main disadvantages, as discussed by Stibic (1980), Salton and McGill (1983), Cleverdon 
(1984), Willett (1985) can be summarized as follows: 
• The formulation of the query, using the Boolean operators AND, OR, and NOT 
is usually a very difficult task for end-users with little experience. Simple features 
such as the conceptual difference between the AND and OR operators, as against 
the day-to-day usage of the words "and" and "or", and the fact that OR is also 
used in the inclusive meaning, i.e., it implies not only "either" but also "both", 
can cause many problems to the great majority of end-users. Thus, to access large 
databases or to search for complex topics the assistance of trained intermediaries 
is required. 
17 
• Searchers have only a limited degree of control over the size of the output that is 
produced in response to any particular query. Without a detailed knowledge of a 
particular database a broad query may involve the retrieval of many hundreds of 
documents while too detailed a query may lead to the retrieval of no documents 
at ali. In both cases, a considerable amount of query reformulation is needed to 
obtain an appropriate volume of output. This is, of course, a very expensive and 
time-consuming process. 
• A Boolean search results in a simple division of the particular database into two 
separate subsets: those records that satisfy the query and those that do not. 
Consequently, ali items in the matching subset have an equal probability of being 
relevant to the searcher. Thus, in the search for relevance it is necessary to 
inspect the entire output list: if 100 records are retrieved, the last record seen 
by the searcher has an equal probability of being as relevant as the first. This 
means that conventional retrieval provides no mechanism for presenting output 
in decreasing order of probable relevance. 
• In a Boolean search, there is no obvious means by which one can weight the terms 
in a query to reflect their relative degree of importance in the search. Boolean 
search—although many studies have shown that it is very easy and beneficial to 
calculate weights—assumes that ali terms have weights of either 1 or 0, depending 
upon whether they happen to be present or absent in the query. 
As a result of the limitations of conventional Boolean retrieval systems, there has been 
an increased interest in the use of best-match searching techniques. In this search 
procedure—also known as nearest neighbour or ranked-output search—the set of key-
word stems resulting from the query-input module is matched against the sets of stems 
corresponding to each of the documents in the database (VVillett, 1988a). VVhen simi-
larity between the query and each document is calculated—for example, the number of 
terms in common as proposed by Cleverdon (1984) in the so-called coordination level 
search—the retrieved documents are sorted into order of descending similarity with the 
query. The output from the search is a ranked list in which those documents which 
18 
the program judges to be most similar to the query are at the top of the list (i.e., 
"tip of the iceberg" as described by Stibic, 1980) and thus displayed first to the user. 
Since measures of similarities are usually based on formulae derived from probability 
theory, the documents at the top of the list are those with the greatest probability of 
being relevant to the query. Best-match retrieval is therefore also known as probabilistic 
retrieval (Porter and Galpin, 1988). 
Using best-match search techniques many of the problems associated with Boolean 
retrieval can be eliminated (Willett, 1985; Willett, 1988a): 
• There is no need for an expert to compose Boolean expressions for end-users. 
Best-match retrieval is very attractive to the end-users since they need input 
only an unstructured list of keywords. 
• Since end-users obtain a ranked list of documents there are no problems associ 
ated with control over the size of the output. End-users can easily regulate the 
recall and precision searching performance. Even if there are hundreds of doc 
uments as candidates for display, a quick, precision-oriented search may involve 
the inspection of only the first 5 or 10 documents in the ranked list while greater 
recall may be obtained by going further down the list. 
• It is very easy for the system to take weighting information into account to de-
termine the degree of similarity between the query and each of the documents. 
• In addition, those weights may also be based on end-users' judgments of rele-
vance on the retrieved documents, and then feedback relevance information can 
be incorporated if a second search is required. 
To put the advantages of best-match searching into practice two crucial components 
are required: 
• an efficient nearest neighbour searching algorithm to permit the calculation of the 
query - document similarities; 
19 
• an effective means of meighting the terms in the query so as to reflect their relative 
importance in discriminating between relevant and non-relevant material in the 
database that is being searched. 
1.3.2 Implementation of best-match searching 
Comparing the advantages and disadvantages of both Boolean and best match retrieval, 
it is surprising that the former is one that predominates and the latter is stili a rarity in 
current operational retrieval svstems. According to Willett (1985), there are two main 
reasons why nearest neighbour searching techniques are not generallv implemented in 
retrieval svstems: 
• since conventional, Boolean svstems have been in use for many years, there is 
not a great deal of willingness by both users and systems providers to develop 
alternative information retrieval techniques; 
• the first experiments in best-match searching were based on the incorrect assump-
tion that the matching function required the comparison of a query with each of 
the documents in the file in turn. Thus, computational expense was perceived to 
be too high for practical implementation of best-match searching techniques. 
Therefore, the main question put to researchers in advanced information retrieval has 
been how to produce a ranked output vvithout the need to scan ali of the documents 
in a particular database. One of the main results of the experimental work over many 
years has been a conclusion that the inverted file organization which forms the basis for 
most current on-line Boolean retrieval systems may also be used for the implementation 
of best-match searching. 
As noted by VVillett (1988a), two main groups of best-match searching algorithms 
have been developed for inverted file organization, using the characteristics of databases 
as a criterion: 
• algorithms implemented on external databases and therefore constrained by the 
facilities of the on-line host retrieval system; nevertheless, there is an increasing 
20 
interest in developing so-called combined searching techniques (i.e., hybrid search) 
which, for example, enable ranking of the output from a Boolean search (see, for 
example, Salton et al., 1983); 
• algorithms implemented on internal, in-house databases and therefore much more 
flexible. 
The latter group has been more extensively studied—a review is given by Perry and 
Willett (1983)—and is also briefly described below. 
The principal problem that must be addressed by the best-match searching algo-
rithm is the identification of terms in common between the query and a document, 
using the inverted file organization. So far, several algorithms have been developed and 
the most efficient seems to be that due to Noreault et al. (1977) which involves the 
addition of query lists. Experiments have shown that, despite the fact that large num-
bers of coefficients are evaluated, this algorithm provides an extremely efficient means 
of obtaining information about the number of keys in common between the query and 
each of the documents (Perry and Willett, 1983). The addition of the query lists may 
be achieved by taking the following steps: 
1. query lists are processed in sequence: when a document number (identifier) is 
encountered for the first time in a query list, a counter is allocated to the document 
and set to one; 
2. this counter is incremented by one each time that the document is encountered 
in subsequent query lists; 
3. when ali of the lists have been processed, each of the counters will contain the 
number of terms common to the query and to the appropriate document. 
In the literature, the following four advantages of this search algorithm have been 
reported (Perry and Willett, 1983; Willett, 1988a): 
• despite the fact that large numbers of coefficients are evaluated, this procedure is 
very fast in operation since the set of query terms is inspected only once in order 
21 
to determine which inverted file lists should be used; 
• the calculation of ali of the similarities involves disc access only for the query 
lists; 
• having the number of terms in common between the query and each of the docu-
ments, it is easy to evaluate the corresponding similarity coefncient and rank the 
documents in order of decreasing similarity with the query; 
• this procedure can be refined by the \veighting of query terms, i.e., by increment-
ing the counters by the weight of each query term rather than by one. In this 
èase, each counter will contain at the end of processing the sum of the weights for 
those terms that are common to the corresponding document and to the query. 
The results of experiments in best-match searching have shown that retrieval ef-
fectiveness can be improved by incorporating procedures for the weighting of search 
terms. 
1.4 Weighting of search terms 
At the heart of nearest neighbour searching is the possibility of discrimination among 
the documents of a collection. A ranked output can then be achieved by using some 
quantitative measure of similarity between the query and each of the documents in the 
collection. As stated by Willett (1988a), the similarity measure consists of two main 
integral components: 
• the term iveighting scheme as a means of assigning weights to each of the index 
terms in a query or a document to demonstrate their relative importance; 
• the similarity coefficient which ušes these weights to calculate the overall degree 
of similarity between a query and each of the documents in the file. 
Results of experimental work have shown that the term weighting scheme plays a more 
important role in the effectiveness of document retrieval systems than the choice of a 
22 
similarity coefficient. 
There have been many difFerent approaches to the calculation of the similarity 
coefficient (Salton and McGill, 1983). Particular attention has been given to the so-
called component-by-component vector product, consisting of the sum of the products 
of corresponding term weights for two vectors. When some term is absent from the 
query or the document then this term will not make any contribution to the similarity 
coefficient, i.e., the numbers of matching properties for two vectors are reduced. Since 
it is customary to ensure that the similarity coefficient remains within certain bounds, 
say between 0 and 1, the so-called cosine coefficient (Salton and McGill, 1983) was 
found very easy to compute and also appeared to be very effective in the retrieval. 
As has already been emphasized, term iveighting schemes are of particular impor-
tance for retrieving documents in strictly ranked order. Term weighting schemes may 
be used either to vveight query terms, or document terms, or both. The effectiveness of 
advanced information retrieval systems has been significantly improved by concentrat-
ing on developing methods for the weighting of query terms, with the documents being 
characterized by binary, i.e., present or absent, indexing terms. However, there is stili 
an interest in the question of whether the terms in documents should also be weighted 
(Salton and McGill, 1983; Salton, 1986). It seems that some additional experimen-
tal studies vvill be needed to confirm the usefulness of such an approach in increasing 
retrieval effectiveness. Therefore, methods for vveighting query terms vvill briefly be 
outlined in this section. 
When tracing developments in research work on term vveighting schemes, starting 
from the first empirical studies to the present firm theoretical basis, the follovving 
concepts are of particular importance: 
• the concept of collection frequency, or inverse document frequency (Sparck Jones, 
1972); 
• the concept of a term relevance vveighting system, based on probability theory 
(Robertson and Sparck Jones, 1976) vvhich enables a relevance feedback search; 
23 
• the concept of the use of relevance weights when no relevance information is 
available, i.e., when the initial search is carried out (Croft and Harper, 1979); 
• the concept which shows the close relationship between inverse document fre-
quency weights and relevance weights (Robertson, 1986). 
The concept of inverse document frequency (IDF), or collection jrequency (Sparck 
Jones, 1972), derives from the hvpothesis that matches on non-frequent terms are 
more valuable than ones on frequent terms. This idea is based on the model of term 
specificitv, which states that very frequently occurring terms are responsible for noise 
in retrieval. Since the main problem in retrieval is to select a few relevant documents 
from many non-relevant ones, in this scheme the search terms are given a weight in-
versely proportional to their collection frequency. The matching value of a term is 
thus correlated with its specificity and the retrieval level of a document is determined 
by the sum of the values of its matching items. Experimental studies, using the IDF 
weighting scheme—which can be implemented by extremely simple means—have given 
results which are superior for best-match searching to those resulting from the use 
of unweighted query terms (e.g., coordination level searching where no distinction is 
made between frequent and non-frequent terms in common). Many subsequent tests 
have also confirmed that this scheme, although in principle based on a very simple 
approach, provides very effective results. 
The IDF weighting scheme is based on information about the distribution of terms 
in documents in a particular collection. Thus, as stated by Willett (1988a) the IDF 
vreight is characteristic of a particular term, and is not specific to a particular request. 
In a search for improving information retrieval the possibility of including relevance in 
formation in the vveight assigned to a query term has also been investigated. As a result, 
a term relevance weighting scheme has been proposed (Robertson and Sparck Jones, 
1976) which reflects the degree to which the query term can discriminate between rele 
vant and non-relevant documents. It is suggested that terms occurring predominantly 
in relevant documents should be assigned greater weights than those terms occurring 
predominantly in non-relevant documents. The use of relevance information as a means 
24 
for the weighting of query terms is based on probability theory. Thus, the assumption 
is made that index terms occur independently in the relevant and non-relevant doc-
uments. In other words, if the probability of the fth term occurring in a document, 
given that the document is relevant, is pi, and the corresponding probability for a 
non-relevant document is g,-, then the weight for the term should be 
logiki 
If it is known exactly which documents in the collection are relevant and which not, 
this weight can be calculated, using the following formula: 
r(N -m- R + r{) 
(n,- - ri)(R - r{) 
where: 
N = the number of documents in the collection 
rii — the number of documents indexed by i 
i = given term 
R = the number of relevant documents for some query 
r, = the number of relevant documents indexed by i 
The theory that leads to this vveighting scheme also results in the specification of a 
similarity coefflcient vvhich corresponds to the sum of the weights for those query terms 
which occur within a document. Thus, if d, denotes the presence (d, = 1) or absence 
(d{ — 0) of the «'th query term in a document, the matching function used is 
^ yqi(l-Pi) 
where the summation is over ali of the terms in the query. 
When full relevance information is available, i.e., where r and R are known for each 
term and query, these weights have been shown to give excellent retrieval performance. 
For example, the concept of the relevance feedback search, which is based on previously 
supplied user judgements on the relevance of displaved documents, is a very attractive 
tool for the new generation of free-text retrieval systems. 
In the normal course of events relevance information is not available. Such a sit-
uation, usually during the initial search, when the user has not yet had a chance to 
25 
26 
provide any relevance information to the system, has been considered by Croft and 
Harper (1979), also using the probabilistic retrieval model. They suggested that p,- and 
g, (as defined by Robertson and Sparck Jones, 1976) should be estimated as follows: 
• pi should be assumed constant; the hypothesis is that ali of the query terms have 
equal probabilities of occurring in relevant documents. Thus, 
where C is a constant; 
• qi should be taken as the proportion of documents in the whole collection; the 
assumption is that the occurrence of a term in a non-relevant document may be 
approximated by its occurrence in the entire collection. Thus, 
Under these circumstances, the weight of the term should be 
Substituting into the relevance weight expression above gives 
which may be expressed as 
where the summation is again over ali of the terms in the query. As can be seen from 
the above expression, this probabilistic model consists of two parts, and is therefore also 
known as the combination rnatch. The first part corresponds to a simple coordination 
level match, i.e., the number of common terms between the query and the document, 
multiplied by the constant log C. Since it is reasonable to suppose that the significant 
words are those with potentially high probabilities of occurring in the relevant docu-
ments, i.e., almost equal to 1.0, Croft and Harper (1979) suggested on the basis of the 
results of experiments that pi should be 0.9 (giving a value of 9.0 for C). The second 
part of the expression is almost identical to the IDF weight, as described by Sparck 
Jones (1972). 
Having the sum of the weights for those query terms which occur in each document, 
a rank list of documents in order of decreasing similarity to the query is displayed to 
the user, who can thereafter perform a relevance feedback search, giving relevance judg-
ments for each displaved document. This second search should reflect user requirements 
more closely than the initial search. 
Finally, an important contribution to the theoretical basis of term weighting schemes 
can be found in a recent article by Robertson (1986). Using again the probabilistic re 
trieval model, a comparison of the so-called point-5 relevance formula (i.e., a slightly 
modified formula of Robertson and Sparck Jones, 1976) and the formula of Croft and 
Harper (1979) demonstrated a very close relationship between IDF weights and rele 
vance weights. 
On the grounds of their successful performance, different term weighting schemes— 
mainly based on a probabilistic retrieval model—are implemented in current operational 
best-match retrieval systems which are therefore also known as probabilistic retrieval 
systems. 
1.5 Conclusions 
Besides automatic indexing, best-match retrieval and term weighting which form the 
nucleus of advanced document retrieval systems, there are also the following, very active 
areas of research by which development of these systems may be affected in the future: 
• cluster analysis, or automatic classification (McCall and Willett, 1986; Grifnths 
et al., 1986; Willett, 1988b); 
• knowledge-based approach, i.e., automatic natural language processing and ex-
pert intermediary systems, as already described in this chapter; 
27 
• serial text searching, based on the use of the text signatures and parallel process-
ing (Pogue and Willett, 1984; Carroll et al., 1988; Willett, 1988a). 
Research work in advanced information retrieval is now beginning to be reflected in 
operational systems of various sorts. These systems have already been shown to be 
able: 
• to retrieve larger amounts relevant material than conventional systems; 
• to replace trained intermediaries by end-users with limited experience of the 
search process. 
One of these systems is INSTRUCT (INteractive System for Teaching Retrieval Using 
Computational Techniques) which was initially designed at the Department of Informa 
tion Studies, University of Sheffleld, for demonstrating advanced information techniques 
to students of librarianship and information science, but is becoming a useful basis for 
testing a range of research problems in information retrieval. The processing routines 
in INSTRUCT are, in very large part, independent of the actual language in which 
the texts have been written. The only exception is the stemming algorithm which has 
to take account of the morphological structure of the particular language. The origi 
nal version of INSTRUCT is thus based on the stemming algorithm developed for the 
English language by Porter (1980). 
Since the use of nearest neighbour searching techniques for providing end-user access 
to databases of Slovene text will be tested by the employment of the INSTRUCT 
package there was a need for an appropriate stop-word list and stemming algorithm. A 
review of stemming algorithms is given in the next chapter. On the basis of this review— 
along with the morphological characteristics of the Slovene language—the most suitable 
techniques for the development and design of the Slovene stemming algorithm will be 
selected. 
28 
Chapter 2 
Automatic Word Conflation 
2.1 Introduction 
When designing both more effective and efRcient retrieval svstems it is necessary to 
develop techniques which will be able to èope with the morphological variations of 
natural language vocabularv. This is particularly important for reference retrieval 
systems, where documents are usually described by the words in the document title, 
keywords, and possibly by the words in the document abstract. A failure to account for 
the morphological complexity of terms can cause a substantial decrease in Information 
retrieval performance. 
The variations in words are mainly caused by the requirements of grammar in a 
particular language, e.g., MORPHOLOGY and MORPHOLOGICAL, from national 
usage, e.g., differences between American and British spelling (LABOR, LABOUR), 
and from mis-spellings. However, in many languages, including English and Slovene, 
terms with a common stem will usually have similar meanings, as for example: 
Slovene English 
BIBLIOGRAFIJA BIBLIOGRAPHY 
BIBLIOGRAFIJE BIBLIOGRAPHIES 
BIBLIOGRAFSKI BIBLIOGRAPHIC 
Consequently, the performance of an information retrieval system can be improved 
29 
if these word variants are reduced to a single canonical form, without altering their 
meaning. This may be done by the removal of the various suffixes -Y, -IES, and -IC in 
English to leave the single stem BIBLIOGRAPH, and similarly by the removal of the 
various endings -IJA, -IJE, and -SKI in Slovene to leave the single stem BIBLIOGRAF. 
A procedure which ušes systematic abbreviation of vrords so as to bring together words 
which are morphologically related, in the hope that they will also be semantically 
related (Walker and Jones, 1987), is known as a conflation procedure. 
Word conflation performs two useful functions in information retrieval systems (Wil-
lett, 1988a): 
• it may reduce the total number of different words, consequently leading to a 
reduction in dictionary size and updating problems; 
• the retrieval effectiveness, particularly the recall performance, may be increased 
with the identification of semantically related terms. 
Recent research in information retrieval has been more concerned with performance 
improvement than with storage reduction (Harman, 1987). 
Conflation of word variants can be achieved in information retrieval either manually 
or automaticallv. An example of manual term conflation is the use of right-hand trun 
cation at search time as specified by the searcher. Although considerable experience is 
needed if effective truncation is to be achieved, the majority of current on-line systems 
use this technique to reduce morphological differences between similar words. Since a 
certain amount of linguistic knowledge and training is required to perform right-hand 
truncation, this task must be carried out by an experienced intermediary and not by a 
casual end-user. Even when end-users are given the opportunity of employing trunca 
tion during searches, for example in experiments by Markey (1983), they consistently 
avoid the use of this facility. 
Although manual word truncation is performed by an experienced intermediary, 
two major types of errors are stili possible (Lovins, 1971): 
• over-truncation occurs when too short a stem remains after truncation; this may 
30 
result in completely unrelated terms being conflated to the same stem, as with 
STRATEGY and STRATIFICATION being retrieved by the stem STRAT* (sim-
ilarly in Slovene STRATEGIJA and STRATIFIKACIJA being retrieved by the 
stem STRAT*); 
• under-truncation happens when too short a suffix is removed and may result 
in related words being described by different stems, as with COMPUTERS be 
ing truncated to COMPUTER*, rather than to COMPUT*, which would also 
include words such as COMPUTING, COMPUTATION (similarb/ in Slovene 
RAÈUNALNIKI being truncated to RAÈUNALNIK*, rather than to RAÈUNAL*, 
which vrould also include words such as RAÈUNALNIŠTVO). 
Both types of error can significantly decrease retrieval performance, the former reducing 
precision of the search, the latter decreasing recall. For this reason, major bibliographic 
database vendors provide devices for intelligent review and selection of candidate terms 
from the dictionary file. Usually, an experienced intermediary performs a truncation 
and then adds (using Boolean OR) to the original search term(s) selected terms from 
the alphabetically sorted display. The result is a term group, consisting of the original 
search term and its variant forms. 
An alternative way of bringing together semantically related word variants is the use 
of a conflation algorithm as part of the automatic computational procedure. The most 
common automatic conflation procedure is the use of a stemming algorithm which, 
"...reduces ali words with the same root (or, if prefixes are left un-
touched, the same stem) to a common form, usually by stripping each word 
of its derivational and inflectional suffixes" (Lovins, 1968, p. 22). 
In most cases prefixes are not removed because they tend to have a more drastic effect 
on the meaning of word than do suffixes; in addition, algorithmic techniques for the 
removal of prefixes are less well studied. The stemming algorithm performs, therefore, 
a similar function to that of manual right-hand truncation. 
The obvious question, in terms of retrieval effectiveness, relates to a difference be-
tween the use of manual right-hand truncation and automatic word conflation. This 
31 
question was addressed by Frakes (1984), who found no significant difference between 
these two confiation procedures. Thus, he concluded, word conflation can be auto-
mated in retrieval svstems with no average loss of performance, consequently allowing 
easier end-user access to the system. As a result of research work in automatic word 
confiation there are several systems which are based on stemming algorithms, for ex-
ample CITE (Ulmschneider and Doszkocs, 1983), MASQUERADE (Brzozowski, 1983), 
MARS (Niedermair et al., 1985), INSTRUCT (Hendry et al, 1986a,b), and OKAPI 
(Walker and Jones, 1987). 
Since the model of a Slovene language-based free-text retrieval system will be devel-
oped using INSTRUCT (Hendry et al, 1986a,b), it is necessary to emphasize that the 
operation of INSTRUCT is language-independent, with one exception, i.e., the stem 
ming algorithm. In this chapter, therefore, both a theoretical background of automatic 
word confiation and a methodological framework for the design of a stemming algo 
rithm for Slovene will be outlined. This will provide a basis for the experimental work 
that is needed to develop a stemming algorithm for the Slovene language. 
This chapter begins with a description of the main characteristics and types of 
confiation algorithms. This will be followed by a review of some of the confiation 
techniques which have been developed mainly for use in information retrieval. However, 
since word stemming has also found its application in the areas of natural language 
processing and computational linguistics, some of these techniques (e.g., morphological 
analysis of terms, as described by Cercone, 1978) will also be described as potentially 
interesting for information retrieval. In the third section some methods and results of 
the evaluation of confiation algorithms will be presented. The final section will serve 
as a starting point for the design of a stemming algorithm for the Slovene language. 
2.2 Characteristics of stemming algorithms 
Stemming algorithms which have been developed over the last two decades differ in 
many ways. The following classification is an attempt to capture the main features of 
automatic word confiation. 
32 
2.2.1 Types of stemming algorithms 
According to Ulmschneider and Doszkocs (1983), stemming algorithms can be broadly 
divided into two classes: stemming purely by morphological analysis of terms, and 
stemming by the application of suffix dictionaries. Purely morphological techniques are 
characterized by the removal of suffixes from words according to their internal structure. 
The algorithm analyses the morphology of the word string, and, guided by rules of term 
morphology, determines candidate locations in the string marking the boundaries of a 
suffix. The optimal sufflx candidate (often the longest) is then removed or replaced. For 
example, the dependency of any letter's appearance in a word upon the letter preceding 
or succeeding its position can be exploited to determine the boundaries of word units 
(Hafer and Weiss, 1974). Although morphological analysis (see also Cercone, 1978; 
Niedermair et al., 1985) has the potential for detecting both prefixes and suffbces, it 
is rarely used, firstly because of the difficulty of deriving comprehensive and reliable 
rules (Ulmschneider and Doszkocs, 1983), and secondly because a substantial amount 
of processing is usually required (Lennon et a/., 1981). 
The most common approach to stemming employs a dictionary of suffizes along 
with rules for its use. When a word is presented for stemming, the presence of these 
sufflxes is searched for at the right-hand end of the word. If a suffix is found to be 
present, and the set of rules is satisfied, it is removed or replaced by another string, 
using either the principle of iteration or longest-match assignment. Ali algorithms in 
this group require the construction of a sufflx dictionary and the formulation of a corpus 
of rules defining the morphological context of the suffbces. It is this group of stemming 
algorithms whose characteristics are described below. 
2.2.2 Compilation of a suffix list 
Ali algorithms based on a suffix dictionary can be broadly divided into two classes 
depending upon the manner of their development. The first class consists of a time-
consuming manual analysis of vocabulary and language behaviour, in order to construct 
a sufflx list and to formulate a corpus of rules. However, this effort is generally associ-
33 
ated with conflation results of a high quality. 
The second class of algorithms is characterized by the automatic generation of the 
sufflx list from the bodies of text. Ali word endings occurring more often than some 
predetermined threshold are selected as suffixes. Apart from the minimum manual 
intervention required, this approach can be adapted to different collections (Tarry, 
1978), if it is assumed that word conflation depends on the type of the source where 
suffixes were found. 
Lennon et al. (1981) revealed that fully automated methods performed as well as 
procedures which involved a large degree of manual involvement in their development. 
They suggested that while the manual evaluation of lists of possible suffixes and rules 
gives results of a very high quality, the length of tirne taken often makes this method 
impractical. Their results confirmed that the consequent reduction in implementation 
costs in automatic generation of sufRxes was not achieved at the expense of a decrease 
in conflation performance. 
2.2.3 Mode of operation of stemming algorithms 
Stemming algorithms can operate on the iterative principle, longest-match principle, 
or on a mixture of both. 
Iteration is based on the fact that sufnxes are attached to stems in a certain order, 
usually in the following form: stem - derivational suffixes - inflectional suffixes. An 
iterative stemming algorithm is simply a procedure which removes sufnxes (sometimes 
single letters or strings rather than true suffixes) one at a tirne, starting at the end of 
a word and working toward its beginning. For example, Porter's iterative algorithm 
(Porter, 1980) processes the word GENERALIZATIONS in four iterations: in the first 
step GENERALIZATIONS is stripped to GENERALIZATION, then in the second step 
to GENERALKE, in the third step to GENERAL, and finally to GENER. 
The longest-match principle states that within any given class of endings, if more 
than one ending provides a match, the one which is the longest should be removed in one 
iteration. In the above example, the longest suffix -ALIZATIONS would be removed 
34 
in one step. This principle is implemented in algorithms by scanning the endings in 
order of decreasing length. If a match is not found on longer endings, shorter ones are 
scanned. 
Consequently, on the one hand, longest-match algorithms are often easier to pro 
gram but require a much larger dictionary since they must include ali compound suffixes 
in each order in which they can appear. Iterative algorithms, on the other hand, while 
they permit the use of a much shorter list of sufnxes, tend to be difficult to design since 
a great many endings must be examined in the preparation of a list. 
These advantages and disadvantages of both the longest-match and the iterative 
algorithms often play a crucial role in the design of a conflation procedure for informa-
tion retrieval. A typical example of an algorithm based on the longest-match principle 
is described by Lovins (1968), while Porter's algorithm (Porter, 1980) ušes the iterative 
principle. 
2.2.4 Conditional rules 
A further important feature of a stemming algorithm is whether it is context-free or 
contezt-sensitive. The latter feature requires rules to be incorporated into an algorithm. 
These rules prevent the removal of a given sufhx or a class of suffixes if a given condition 
is not satisned. Conditional rules are usually of a quantitative nature and thus involve a 
minimum length condition, as for example, do not remove the suffbc if the resultant stem 
would be less than four characters long. There may also be some qualitative contextual 
restrictions, which usually include word-specific rules, e.g., remove -ED unless the word 
is UNITED. 
In a context-free algorithm, converselv, no quantitative or qualitative restrictions 
are placed on the removal of endings, and thus any ending which matches a word is 
accepted for stripping. 
35 
2.2.5 Recoding rules 
Algorithms may also include some recoding rules which are applied after stemming has 
taken plaèe. Recoding rules make changes at the end of the resultant stem to achieve 
the ultimate matching of varying stems. For example, the word FORGETTING might 
well be stemmed to FORGETT by removal of the suffix -ING. This will not match the 
word FORGET because of the repetition of the terminal consonant, and so a simple 
recoding rule is to remove one of any such doublings at the end of a stem. Other recoding 
rules may be used to achieve better conflation by rewriting some stems. For example, 
terminal -Y may be replaced by -I to conflate word forms ending in -Y with other 
related words: this rule would apply to the word LIBRARY to retrieve LIBRARIES 
when the -ES suffix is removed. Another example is the changing of -PT to -B to 
conflate word forms ending in -B with grammatically related words which change the 
-B to -PT: this rule would apply to the word ABSORPTION to retrieve ABSORB and 
ABSORBENT. 
The construction of a comprehensive set of context-sensitive and recoding rules is 
one of the most difRcult and time-consuming parts in the design process of stemming 
algorithm. Although there is no doubt that better conflation can be achieved by using 
these rules, it is important to emphasize that, 
"... there comes a stage in the development of a suffix stripping program 
where the addition of more rules to increase the performance in one area 
of the vocabulary causes an equal degradation of performance elsewhere" 
(Porter, 1980, p. 131). 
Thus, there is always a danger of the algorithm becoming more complicated than it need 
to be. One of the important features of the design of a stemming algorithm is to obtain 
a balance between the number of rules, and simplicity and efficiency of processing. 
As an alternative, or addition to the use of a set of predetermined recoding rules, 
some authors (for example, Dawson, 1974) suggest the use of a partial matching pro 
cedure at search time. This should allow a pair of non-identical stems to be matched 
if they are in large part similar. ABSORB and ABSORPTION can again serve as an 
36 
example, since they are similar in the first five letters. This approach is also related to 
string similarity measures which are described by Angeli et al. (1983). These measures 
were mainly developed as part of experiments on automatic spelling corrections. Cor-
rection programs typically operate by searching in a large machine-readable dictionary 
for a word from the text to be corrected. If the word is not present, than it is assumed 
to be mis-spelt and some procedure is adopted that converts the mis-spelt word into 
a word that is present in the dictionary. The numbers of n-grams common to pairs of 
words (Freund and Willett, 1982), or the SPEEDCOP algorithm (Pollock and Zamora, 
1984) are representatives of similarity-based techniques. 
2.2.6 Users' needs 
When developing a stemming algorithm it is also very important to consider its ap-
plication area, i.e., whether it is going to be used in a specialized on-line reference 
retrieval system or in an on-line catalogue. This difference refers to the design of so-
called "strong" stemming (i.e., longer classes of endings are removed from the word) or 
so-called "weak" stemming (i.e., only plurals and singulars are removed; this algorithm 
is also known as the "S" algorithm - Harman, 1987). 
Since typical queries in specialized on-line reference retrieval systems usually include 
at least three words, and often more, it may not matter if some of the terms contain 
a substantial proportion of false drops attributable to stemming. Consequently, strong 
stemming can increase recall in these systems without unduly decreasing precision, as 
is shown, for example, in experiments by Frakes (1984). In addition, users of these 
systems are usually interested in an exhaustive search, and are prepared to examine 
even a substantial proportion of irrelevant material which could be caused by the use 
of "strong" stemming. 
On the other hand, the use of a general on-line catalogue differs very much from the 
use of a reference retrieval system. As shown in experimental tests by Walker and Jones 
(1987) most search statements consist of only one or two words. Thus, even using only 
"weak" stemming can lead to unexpected results (e.g., compare the words RIGHT and 
37 
RIGHTS). In addition, many library users do not want an exhaustive search; they want 
to find one or two relevant items, and are not prepared to look at dozens of irrelevant 
records before they find them. 
On the basis of the above assumption, the idea of a multi-level conflation system is 
advocated by Walker and Jones (1987), and also implemented in the OKAPI project, 
which is concerned with the design of an on-line public access catalogue at the Poly-
technic of Central London. In the multi-level conflation procedure, the main difference 
is made between weak and strong stemming. While in the weak stemming, only plu-
rals and singulars are conflated, the strong stemming removes longer classes of either 
derivational and inflectional suffixes. In addition, to avoid a good deal of noise in in-
formation retrieval, strong stems are given lower weights than weak stems, and thus, 
such records are displayed after the records retrieved on weak stems. 
2.2.7 Language dependency of a stemming algorithm 
An important characteristic of a stemming algorithm is its language dependency, which 
includes both the national usage of a language and the use of a professional terminology 
within individual languages. 
Although there are some exceptions (see, for example, Jappinen et al., 1985), the 
great majority of stemming algorithms are designed for English language environments, 
i.e., they use suffix dictionaries composed of English word endings. However, the crucial 
part of these algorithms consists of difFerent methods and techniques (i.e., modes of 
operation, the use of conditional and recoding rules) which can be applied to any 
other language where the semantic significance is contained in the stems, and not in 
the suffixes. Accordingly, the grammatical characteristics of the individual language, 
particularly the level of its morphological complexity, determines the adoption of these 
conflation techniques. 
In this context, it is of a particular interest to note that the German language 
is characterized by a large number of compound vrords. Thus, as noted by Fuhr 
(1990), stemming techniques which were designed for the English language, and are 
38 
also dictionary-independent, could not produce successful results on German texts. In-
stead, a great amount of work has been devoted to the morphological segmentation of 
these compound words. For example, Wenzel (1980) has discussed the use of segmen 
tation techniques analogous to those described by Hafer and Weiss (1974). 
Within any national language it is also important to consider potential variations 
of stemming performance across different subject areas. While most algorithms are 
generalized in approach, some of them were also developed with a special emphasis 
to a particular subject area. Such an example is a stemming of medical English, as 
described by Ulmschneider and Doszkocs (1983). 
2.2.8 Some other characteristics of stemming algorithms 
In any suffix stripping program for information retrieval work, two points must be borne 
in mind. Firstly, the suffixes are being removed simply to improve information retrieval 
performance, and not as a linguistic exercise (Porter, 1980). It is not at ali necessary 
that reduced morphological variants coincide with a linguistically correct root or stem, 
or that the character strings to be removed need be linguistically accepted affixes. The 
main problem to be solved using stemming is to reduce the variants of a word to a 
single canonical form, without altering the meaning. 
The correctness of this approach was confirmed in experimental tests by Frakes 
(1984). His hypothesis was based on the linguistic theory that the terms truncated on 
the right at the root morpheme boundary will perform better than terms truncated on 
the right at other points. The hypothesis was not realized because small deviations 
from root boundaries did not significantly affect retrieval performance. 
The second point to be considered in the design of a stemming algorithm is the fact 
that the success rate for suffix stripping is always less than 100% (Porter, 1980). In 
the development of a stemming algorithm, there is usually a temptation to deal with 
word forms which appear to be important but which are rare in most applications. An 
attempt to èope with such cases can result in the addition of many rules which can 
lead to a complexity in the program. In view of the error rate that must in any èase be 
39 
expected, it is not worthwhile trying to èope with such cases. As already stated, the 
main aim of a stemming procedure is to obtain a balance between the number of rules 
and the simplicity and efficiency of processing. 
Many of the characteristics of stemming described above are included in algorithms 
which have been developed over the last two decades, mainly for the English language 
environment. In the following section, a condensed review of some of these algorithms 
is given. Apart from studying the original reports, both the comparative paper by 
Lennon et al. (1981) and the survey by VValker and Jones (1987) form the basis of this 
review. 
2.3 Conflation algorithms: a review 
2.3.1 Lovins 
One of the first conflation algorithms to be developed and tested was part of Project 
Intrex (Overhage and Reintjes, 1974) which was used for extensive experiments in 
on-line retrieval of library-type material. VVord stemming was incorporated into this 
system to improve the effectiveness of information retrieval. 
Lovins (1968), who participated in this project, obtained a preliminary list of end-
ings by examining suffixes of a small portion of words in the Project Intrex catalogue 
and by studying a list of endings used at Harvard. The preliminary list was evaluated 
by applying the endings to dictionaries of normal and reversed English words to see 
whether the removal of a given ending would result in (1) two different stems match-
ing, or (2) a stem not matching another stem which it should match. Either of these 
conditions necessitated the addition of new endings, the disposal of old ones, or the 
addition of context-sensitive and recoding rules. 
This manual assessment resulted in the final list which contained about 260 endings, 
divided into 11 subsets; the subsets were ordered in accordance with the decreasing 
length of the endings and were internally alphabetized for easy handling. In this longest-
match algorithm, each suffix was associated with one of 29 context-sensitive rules; there 
40 
were also 34 recoding rules to èope with words such as METER and METRIC. 
Lovins' algorithm was also employed in some other experimental information re 
trieval svstems, one of them being MASQUERADE (Brzozowski, 1983). 
2.3.2 Dawson 
Dawson's algorithm (Dawson, 1974) is based on that developed by Lovins (1968), and 
thus a longest-match method is used. 
Hovvever, in his design of the algorithm, Dawson found the initial list of about 
260 suffixes to be incomplete, lacking most plurals and other combinations of simple 
suffixes. His inclusion of additional endings resulted in a list of about 1,200 suffixes. 
To avoid the problems of storage and processing time which could be created by this 
large suffix list, Dawson used the principle of reversing the sufflxes (and word specific 
suffix removal conditions) and indexing them by length and by final letter. 
Unlike most of the algorithms, Dawson does not use recoding, describing this process 
as being extremely unreliable. Instead, his algorithm works basically on a partial 
matching principle, i.e., words are matched if their stems are "nearly" identical. Thus, 
the provision is made for the matching of, for example, ABSORB and ABSORPT. This 
is done by having a set of standard stem endings which can be considered equivalent 
(e.g., -RB and -RPT, or -MIT and -MISS). Dawson included fifty of these stem ending 
classes and the basic principle of his algorithm is as follows: if two stems match up to a 
certain number of characters and the remaining characters of each stem belong to the 
same stem ending class, then the two stems are conflated to the same form. 
Both the extensive suffix list and removal conditions were drawn up manually using 
a Key-Letter-In-Context (KLIC) index; a similar approach to that used by Field (1975). 
2.3.3 RADCOL 
Lowe et al. (1973) tested two stemming algorithms as part of the RADCOL project, 
i.e., the information storage and retrieval system developed by Informatics for the Rome 
41 
Air Development Centre. The first algorithm involved two passes through a single list 
of 95 suffixes, but this was rejected in favour of a single pass, longest-match algorithm 
containing a much longer list of 570 endings. 
To obtain this list, a multi-stage process was used. First, the characters of the most 
frequent words in the index (i.e., words occurring more than 10 times) were reversed, 
and the reversed words were sorted into alphabetical order. Adjacent words in the 
ordered list were then compared and whenever a match of n characters was found, 
strings containing 1, 2, ..., n characters were written out to tape. Thus, the list 
would contain characters strings such as -G, -NG, and -ING. These strings were sorted, 
cumulated, and the most frequent endings used as the starting point for a manual 
selection of the final suffix list. The final list was completed by examining the effect 
suffix adoption would have on the vrords in the collection and by comparison with the 
suffixes from the Lovins' algorithm. 
Although the algorithm developed by Lowe et al. (1973), ušes a long list of sumxes, 
the longest-match procedure is simple in application since there are only two context-
sensitive and three recoding rules. 
2.3.4 INSPEC 
The conflation algorithm, designed at INSPEC (Field, 1975), was developed for sta-
tistical studies on the frequency and growth characteristics of free language indexing 
vocabularies. Both the vrord ending lists and the associated rules were drawn up man-
ually, utilizing a KLIC (Key-Letter-In-Context) index. The KLIC index was produced 
from the single index words assigned to the documents in the test database, with each 
word being filed under each of its constituent characters. The KLIC index was arranged 
in the alphabetical order of the filing letter, and included a frequency count for each 
vrord type, and also for each distinct vrord ending. This listing was then scanned man-
ually to give the lists of endings and context-sensitive rules to be used in the automatic 
stemming procedure described below. 
The INSPEC algorithm is a mixture of longest-match and iterative suffix removal. 
42 
Minimum stem length, recoding rules, and three-stage conflation are the main features 
which were designed to improve its effectiveness. Of particular interest is the three-
stage conflation procedure. The first stage (Algorithm 0), which is partly iterative in 
character, removes very common endings, such as plural forms, and eliminates stop-
listed words. Words which are not stopped are then passed to the second part of the 
algorithm (Algorithm 1) which carries out most of the suffix removal. In this stage, 
a longest-match routine is used in which each suffix has an associated set of context-
sensitive rules and a minimum permissible stem length. In the final stage (Algorithm 
2), the algorithm makes adjustments to the stem, usually on the basis of stem length. 
Field (1975) claims that the use of a three stage process leads to a significant 
increase in the efRciency of word conflation. The idea of a multi-level conflation system 
was adopted by Walker and Jones (1987) in their project OKAPI. 
2.3.5 Automatic generation of suffix lists 
Lennon et al. (1981), in the course of their evaluation of conflation algorithms, extended 
the method used in the RADCOL Project to achieve entirely automatic generation of 
suffix lists. A vocabulary of reversed words was used to produce a list of word endings 
occurring more often than some predetermined threshold. This list can then be used in a 
context-free, longest-match procedure, which is also known as the frequency algorithm. 
There is no doubt that, on the one hand, context-free algorithms with no conditional or 
recoding rules are much simpler to develop and may also be more efficient at run tirne 
since no character matching need be carried out to determine the context. However, 
on the other hand, the potential disadvantage of such a simple approach is, as stated 
by Lennon et al. (1981), that the inclusion of a string such as -INGNESS in a suffix set 
necessarily implies that ali of the constituent substrings, e.g., -NGNESS and -GNESS, 
will also be included and thus the proportion of potentially useful suffixes in the set is 
much reduced. 
A related method was used by Tarry (1978) who generated several sets of equifre-
quent character strings from the ends of words using an algorithm similar to that 
43 
described by Lynch (1977) and Cooper and Lynch (1979). The algorithm is based on 
the variety generation technique, i.e., on the selection of character strings of variable 
length occurring with approximately equal frequencies, and with low sequential depen-
dence in a given body of text. A starting point in generating such "symbol sets" is the 
assumption that character strings representing suffixes would occur more frequently 
than other terminal character strings. In addition, it is also assumed that letter depen-
dency within words decreases at the boundaries of word units such as suffixes. Thus, 
by generating symbol sets from the backs of the words, utilizing the above assump-
tions, it is possible to produce a workable set of suffixes which are not necessary correct 
linguistic endings. 
Tarry's algorithm works on the longest-match principle, using suffix lists, generated 
by the method described above. The algorithm has no restrictions on suffix removal 
other than that the remaining stems should be of a minimum length of three characters. 
The algorithm is context-free, and no recoding is carried out, nor is partial matching 
employed at the conflation stage. Tarry justified this approach by the desirability of 
eliminating the large amount of manual preprocessing required, both in the construction 
of the suffix lists, and in the formulation of the suffix removal rules. The additional 
advantage of this procedure might be in automatically determining subject-specific or 
language-specific lists of suffixes. 
Tarry's algorithm was compared with the INSPEC algorithm (Tarry, 1978), and 
the difference in performance between the two algorithms was found to be quite small. 
2.3.6 Hafer and Weiss 
One of the conflation algorithms which also does away with a substantial amount of 
manual preprocessing, for example the drawing up of the suffix lists, was developed by 
Hafer and vVeiss (1974). This algorithm is based on segmenting lexical text into stems 
and affixes. 
The stemming technique employed ušes the concept of successor and predecessor 
varieties which are the numbers of distinct letters succeeding and preceding a given 
44 
character string in a text corpus. The motivation for using these quantities is the fact 
that within a word, the it h letter is dependent to some degree on the i - 1 letters that 
precede it. Within a natural word unit (i.e., a term or affix), this dependence is quite 
strong and increases with increased i. But if the ith letter begins a new word unit, 
the dependence is greatly reduced, (e.g., -M in the word ANTIMATTER is not at aH 
dependent on its four predecessors). Within word units the successor variety is low and 
tends to decrease from left to right, while at boundaries the successor variety rises. By 
calculating the set of successor varieties for a test word and noting the peaks, the word 
units can be detected, and therefore both suffixes and prefixes can be removed from 
words. 
Hafer and Weiss (1974) tested a total of 15 segmentation strategies, ranging from 
simply segmenting a word whenever a successor variety exceeded some predetermined 
limit, to using entropy methods, in which each successor letter was weighted by its 
probability of occurrence. The strategy which performed best in comparison vvith 
manual afflx removal required either both the successor and predecessor varieties to 
exceed a threshold value, or the successor variety to be negative. 
It is claimed by Hafer and VVeiss (1974) that segmentation by this method achieves 
accuracy at least sufficient for the purposes of word conflation, and that the retrieval 
results obtained with various test collections are identical to those obtained with algo-
rithms incorporating more manual processing. Their additional argument is that this 
method allows the text corpus to determine the segmentation points, making it more 
adaptable to changes in a collection, or to a new collection. 
However, in the evaluation experiments by Lennon et al. (1981), the Hafer-Weiss 
method performed worse than other tested algorithms. It was also found that this 
algorithm required a substantial amount of processing to determine the predecessor and 
successor varieties since the entire dictionary, and its reversed form, must be inspected 
for segmentation to take plaèe. 
45 
2.3.7 SMART 
The SMART system (Salton, 1971) ušes an enhanced version of the Lovins stemmer 
that removes many different suffixes. The stemming algorithm implemented in the 
SMART system operates in the following way. First, the longest possible suffix is 
found that allows the remaining stem to be of length 2 or greater. The resulting word 
stem is then checked against an exception list for the given sumx, and, if passed, is 
processed into the final stem in a cleanup step. This recoding step ušes a set of rules to 
produce the proper word ending, such as removing a double consonant. The algorithm 
ušes auxiliary files containing a list of over 260 possible suffixes, a large exception list, 
and the recoding rules. 
2.3.8 MORPHS 
The retrieval system MORPHS, i.e., Minicomputer Operated Retrieval (Partially Heuris-
tic) System, which incorporates automatic stemming for compact storage, document 
retrieval and automatic role indicating, is described by Bell and Jones (1976). Auto 
matic stemming is based both on the standardization of word forms (mainly, plural 
forms are substituted by the singular) and on the use of so-called role indicators. 
The automatic role indication relies on the fact that affixes of a word can provide 
information about the function of that word. In MORPHS, when an affix is stripped 
from a word, it is replaced by the role indicator associated with that affix. This forms 
a potential for searching either roots or derived forms. Thus, it is possible to search 
MIX; or MIX (role A) - implying MDCING; or MIX (role D) - implying MIXED. 
Automatic stemming in MORPHS is based on the use of an extensive suffbc list. 
The length of this list is due in part to the number of exceptions incorporated (thus, 
CATION and STATION are protected from the -ION stripping routine), and in part 
to the presence of chemical suffixes. The system attempts to guard against the re-
moval of apparent suffix strings by a minimum stem length and by checking that the 
stem is present in the stem dictionary before the affix is removed. Both longest-match 
and iterative methods are used in the removal of suffixes and the creation of stem 
46 
dictionaries. For example, the word PREVULCANIZATION is stripped flrst to PRE-
VULCANIS (A), then to PREVULCAN (A) and finally to VULCAN (DA); the letters 
within parentheses indicate the role indicator. 
2.3.9 Cercone 
Some conflation algorithms have been developed and tested for natural language appli-
cations. Since the design of such algorithms is primarily concerned with meanings and 
functions of words, there is a need for a higher degree of linguistic correctness, which 
usually results in a complicated stemming process. One of these algorithms, known as 
the morphological algorithm, was developed by Cercone (1978). 
Cercone's algorithm aims to determine the root of a term by removing suffixes 
and prefixes, using the principle of iteration and consulting an afRx dictionary. This 
process ušes a system of order classes, assuming that affixes, and particularly sufflxes, 
are attached to stems in a certain order. The removal of some affixes is followed by 
recoding of the root. After recoding, the root dictionary is searched and, if a match is 
made, the root and the affix are output. If there is no match, the next order class of 
affixes is accessed until a root is eventually found. 
Thus, the algorithm requires a root dictionary containing ali possible root forms, 
various affix lists, and the construction of recoding rules. Ali of these were drawn up 
manually. It is claimed by Cercone (1978) that such a morphological analyzer can 
significantly aid the identification of the function and meanings of words in a given 
corpus of text. 
2.3.10 MARS 
Niedermair et al. (1985) have developed a system called MARS which aims to facilitate 
the user's access to ali searchable terms in a database which are morphologically related 
to the given search term. Linguistic knowledge and word decomposition procedures are 
at the heart of this system. 
47 
The operation of MARS, after the automatic elimination of stop-words, is based 
on the use of a morpheme dictionary and morpheme grammar which are employed to 
achieve the morphological decomposition of words, i.e., to split words into prefix, stem, 
derivational and inflectional elements. The extracted word stems are collected in a 
stem-file in which pointers back to the text words containing the particular stem can 
be followed, enabling retrieval of these words. 
The morpheme dictionary contains affixes, inflectional endings, and fillers. They 
are ali represented in a uniform way, that is, the string itself and a 32 bit-string in-
dicating special morpheme characteristics and certain compositional properties. The 
morphemes in the dictionary are the longest possible strings obtainable from ali of its 
possible derivations (TRADITIONALLY, for instance, would be viewed as a derivation 
of TRADITION and not of TRAD(E)). Two smaller lists are added to this dictionary; 
one includes "irregular" stems like Latin and Greek plurals and irregular verb forms, 
the other list contains strings which regularly undergo grammatical change, such as -Y 
to -IE. 
Before the actual decomposition starts, a pre-processor checks to see if string trans-
formations are necessary. After this, the three lists mentioned above are used by a 
decomposition grammar which deals with each word. After having reached a certain 
state in the word, say prefix, certain conditions have to be fulfilled if the word is to be 
passed to the next stage. These rules are coded in a morpheme grammar. 
The retrieval performance of MARS was tested on a sample of twelve real searches. 
It is reported by Niedermair et al. (1985) that recall was increased by 68% compared to 
the total number of documents retrieved without MARS; this was achieved without a 
significant decrease in precision which dropped only by 7%. Retrieval tests also revealed 
that MARS performed less effectively when searching compound words, phrases and 
verbs. 
48 
2.3.11 Porter 
Porter's iterative stemming algorithm was developed at the University of Cambridge 
Computer Laboratory (Porter, 1980). The following starting points guided a construc-
tion of this algorithm: the algorithm was developed to improve information retrieval 
performance; its design was not thought to be a linguistic exercise; a certain error rate 
was expected. The algorithm ušes an explicit list of endings, and, with each suffix, the 
criterion under which it may be removed. 
The most important part of Porter's algorithm is the concept of the "measure" of 
a word, which guards the removal of suffixes when the stem is too short. This mea 
sure describes the length and number of consonant-vowel-consonant strings present, a 
concept first studied by Dolby and Resnikoff (1964) to establish certain regularities in 
the structure of written English words. Since the employment of this concept greatly 
contributes to the simplicity and efRciency of Porter's algorithm it is described below 
in detail. 
According to Dolby and Resnikoff (1964), a word is defined as a lexed item rep-
resented by a sequence of letters of the alphabet. Porter (1980) adopted their idea 
of two main sets of letters, i.e., a set of vowels (A, E, I, O, U, and Y preceded by a 
consonant), and a set of consonants. A consonant can be indicated by c, a vowel by v. 
A list ccc... of length greater than 0 can be denoted by C, and a list vvv... of length 
greater than 0 can be denoted by V. Any word, or part of a word, therefore has one of 
the four forms: 
CVCV... C 
CVCV... V 
VCVC... C 
VCVC ...V 
These may ali be represented by the single form 
[C] VCVC ...[V] 
where the square brackets denote arbitrary presence of their contents. Using ( VC]m to 
denote VC repeated m times, this may again be written as 
49 
[C] (VC)m [V] 
where m is the measure of any word or word part, i.e., VC. The èase m = 0 covers the 
null word. Here are some examples, taken from Porter (1980): 
m = 0 TR, EE, TREE, Y, BY. 
m = 1 TROUBLE, OATS, TREES, IVY. 
m = 2 TROUBLES, PRIVATE, OATEN, ORRERY. 
The measure, m, is therefore used to help decide vvhether or not it is wise to remove a 
suffix. For example, -ATE is removed from DERIVATE (m is greater than 1), but not 
from RELATE. 
Porter's algorithm is a five-step, iterative procedure, using a dictionary of about 60 
suffixes. Step 1 deals with plurals and past participles, the subsequent steps are much 
more straightforvvard. The algorithm has only a few context-sensitive and recoding 
rules, and so is economical in computing tirne and in storage. Despite its simplicity, 
retrieval tests (Porter, 1980) showed that Porter's algorithm performed slightly better 
than the much more complicated procedure described by Dawson (1974). 
The advantages of Porter's algorithm include its simplicity and efficiency of pro-
cessing, its detailed description for easy implementation in any high-level programming 
language and its good performance in retrieval tests by Lennon et al. (1981). It has 
thus been implemented in several experimental retrieval systems, these including CAT-
ALOG (Frakes, 1984), INSTRUCT (Hendry et al., 1986a,b), and OKAPI (Walker and 
Jones, 1987). The way in which Porter's algorithm was incorporated in the OKAPI 
system (Walker and Jones, 1987) is of particular interest and it is therefore described 
below. 
2.3.12 OKAPI 
OKAPI (Walker and Jones, 1987) is a system which was developed on the basis of 
results of on-line catalogue research at the Polytechnic of Central London. While the 
first version of OKAPI (OKAPI'84) was mainly concerned with the design of modules 
which could allow end-users to perform on-line searches, more recent work (OKAPI'86) 
50 
has concentrated on the improvement of subject retrieval. Consequently, the following 
three devices were incorporated into OKAPI: automatic stemming, automatic cross-
referencing, and semi-automatic spelling correction. 
OKAPI is reported to be the first on-line catalogue, accessing a general collection, 
the performance of which is based on automatic stemming. Since the use of uninhibited 
stemming in on-line catalogues can lead to a good deal of noise, as stated by Walker 
and Jones (1987), the idea of a multi-level conflation system was adopted in OKAPI. 
The actual stemming procedure used was that of Porter (1980), splitting it into two 
levels, i.e., weak and strong stemming. 
In Stage 1 (weak stemming), regular English plurals and -ED and -ING endings are 
first removed, and then most double consonant endings reduced to single. In addition, 
no endings are removed from words under four letters long or from "words" which 
contain digits or other non-alphabetic characters. Specifically, Step 1 of the original 
Porter's algorithm is done, followed by a spelling standardization, which is mainly an 
attempt to èope with the differences betvveen British and American spelling. In Stage 
2 (strong stemming), endings given in Steps 2 to 5 of Porter's algorithm are removed. 
However, to avoid noise in performance, strong stems are given lower weights than 
weak stems, and those records are displaved after records retrieved with weak stems. 
An additional interesting feature of this procedure is that stems are never actually 
displayed to the user, since words often look strange when they have been stemmed. 
Results of evaluation tests on the OKAPI svstem (Walker and Jones, 1987), based 
on a set of 255 searches, revealed that weak stemming is entirely beneficial in subject 
retrieval in on-line public access catalogues. A significantly higher proportion of relevant 
documents was retrieved than without stemming. On the other hand, strong stemming 
was not always found to be useful, although it behaved well more often than it behaved 
badly. Walker and Jones (1987) therefore suggested that strong stemming should not 
be used alone in a general catalogue; and when used with weak stemming, strong stems 
must be given lower weights than corresponding weak stems. 
Since OKAPI is at present a unique representative of an on-line catalogue which ušes 
51 
automatic stemming when accessing a general collection, the results of the experimental 
work by Walker and Jones (1987) can be very useful in improving access to any academic 
or public library database, as well as to some specialized bibliographic databases in 
which the content of documents is represented only by title and descriptors, and not 
by abstracts. 
2.3.13 CITE 
Another example of an on-line catalogue which ušes a stemming procedure, but which 
is designed specifically for medical terminology, is CITE (Ulmschneider and Doszkocs, 
1983). This catalogue enables access to the monograph collection at the National 
Library of Medicine in Betheseda, Maryland, where it is used by medical researchers 
and students. Its main features are as follows: it accepts queries in ordinary medical 
language, ušes automatic stemming and synonym generation via MeSH (i.e., Medical 
Subject Headings), assigns weiglits to stems and headings which determine their relative 
importance, outputs records in ranked order, and also allows relevance feedback. 
The stemming procedure in CITE shares many of the characteristics of the ap-
proaches to term confiation described above. For example, it employs a suffix dictio-
nary along with application rules and is intended for English language text. However, 
a special additional emphasis is placed on the stemming of medical English and on its 
dependency on the MeSH structure. 
A sufnx dictionary was designed on the basis of an analysis of the terminal character 
strings of ali unique terms in MEDLINE. To obtain this dictionary ali MEDLINE 
terms were sorted by their terminating characters, and then compared to find matching 
strings. Each unique terminal string was then listed in reverse order along with its 
frequency of occurrence. The relative frequency of the terminal string itself, as well 
as the frequency of strings either containing the candidate string or contained in the 
candidate string, determined the construction of the final suffix list. In addition, many 
exceptional cases were also included in the suffix dictionary. 
The stemming algorithm, which is iterative in operation, consists of the identifica-
52 
tion of the word stem and the automatic selection of "well-forrned" morphological word 
variants from the actual inverted file entries. These term groups are further enriched by 
controlled vocabulary indexing terms from NLM's Medical Subject Headings (MeSH) 
which also includes forms of synonyms. 
Although CITE provides many advanced features in word conflation, there is no 
published data on its actual use or effectiveness (Walker and Jones, 1987). In general, 
as described below, there is an evident need for more comprehensive, particularly com-
parative and quantitative, evaluations of stemming algorithms for information retrieval 
to be carried out in the future. 
2.4 Evaluation of conflation algorithms for information 
retrieval 
The retrieval performance of stemming algorithms can be evaluated using test results 
of different levels of comparison. The most common levels are the following: a compar-
ison between full word retrieval and retrieval using automatic stemming, a comparison 
between right-hand truncation and automatic word conflation, and a comparison be-
tween different stemming algorithms. The next three sections form a survey of some of 
the evaluation studies which have been published. 
2.4.1 Automatic stemming vs. full word retrieval 
Some of the retrieval tests (e.g., Lennon et al., 1981; Niedermair et al., 1985; Walker 
and Jones, 1987) have shown that conflation algorithms perform significantly better in 
information retrieval than the use of unstemmed words. 
However, there is evidence, reported by Harman (1987), that the use of stemming 
algorithms does not necessarily result in improvements in information retrieval. Using 
three general purpose stemming algorithms (Porter, Lovins, and the "S" algorithm) 
on three different collections, her tests revealed no substantial difference between full 
word retrieval and retrieval using suffixing. Although individual queries were affected 
53 
by stemming, the number of queries with improved performance tended to equal the 
number with poorer performance, thereby resulting in little overall change for the entire 
test collection. Additionally, stemming caused a significant increase in query processing 
tirne. Despite these results, she concluded that 
"... the stemming of query terms is intuitive to many users, is more con-
venient than specifically using truncation and wildcard characters in queries, 
and is often necessary for helping queries retrieve relevant documents in the 
top ten or thirty documents" (p. 106). 
2.4.2 Automatic stemming vs. right-hand truncation 
Since term conflation is normally achieved in conventional on-line systems using right-
hand truncation as specified by the searcher, the obvious question is \vhether word con 
flation can be automated in retrieval systems with no average loss of performance. The 
most comprehensive experimental study so far, addressing this question, was carried 
out by Frakes (1984) who found no significant difference between these two conflation 
procedures. He concluded that term conflation can be automated in a retrieval system, 
thus allowing easier end-user access to the system. 
2.4.3 Evaluation of different conflation algorithms 
Since automatic stemming can potentially increase retrieval effectiveness, the evaluation 
of stemming algorithms must consider the following points: the efficiency of operation 
(i.e., the number of conditional and recoding rules, the amount of processing time 
required); the ease of implementation (i.e., a detailed description of an algorithm) and 
the amount of manual involvement in the development of an algorithm. Some of these 
points, as described in the above sections, were analyzed by Porter (1980) and Lennon 
et al. (1981). 
However, the most important part of the evaluation of stemming algorithms relates 
to their main functions, i.e., to their effectiveness in: 
• decreasing the size of dictionaries, and 
54 
• increasing retrieval performance. 
The comparison of different stemming algorithms, considering both aspects of effec-
tiveness, has so far been carried out only in retrieval tests by Lennon et al. (1981). In 
addition, the use of different strengths of stemming has recently been studied by Keen 
(1991b). There have also been some other studies reported, but they have been mainly 
interested in the comparison of only two algorithms, for example Porter's comparison 
of his and Dawson's algorithm (Porter, 1980). The results of retrieval tests by Lennon 
et al. (1981) are briefly described below. 
Lennon et al. (1981) tested six conflation algorithms (i.e., INSPEC, Lovins, RAD-
COL, Porter, Hafer-Weiss, and the so-called "frequency algorithm") on several databases 
Their starting point for evaluation was a notion that errors in stemming algorithms can 
be of two types: either a word can be understemmed, in which èase too little of the 
word is removed, or it can be overstemmed, when the converse applies. Consequently, 
understemming leads, on the one hand, to the omission of relevant material, and there 
fore to lower recall, while overstemming, on the other hand, causes the retrieval of 
irrelevant documents, and therefore leads to lower precision. The amount of dictionary 
compression can be used as an indicator of both understemming or overstemming. 
Compression results achieved in tests by Lennon et al. (1981) confirmed the existing 
correlation between overstemming and dictionary compression. For example, the RAD-
COL algorithm achieved the greatest compression (49.1%) but it tended to overstem, 
while Porter's algorithm achieved the least compression, but tended to understem. 
Consequently, since stemming is mainly a recall-oriented device, it was expected 
that strong algorithms would tend to increase retrieval effectiveness more than weak 
ones, especially in a recall oriented search. However, retrieval tests demonstrated that 
there is no relationship between the strength of an algorithm and the consequent re 
trieval effectiveness arising from its use. For example, Porter's algorithm tended to 
understem, but it performed better than the RADCOL algorithm which tended to 
overstem. The INSPEC algorithm, on the other hand, is also a strong algorithm, but it 
gave the best precision-oriented search. Experiments by Keen (1991b) using different 
55 
strengths of stemming did not confirm average improvements in performance of any 
magnitude either. 
2.5 Conclusions 
Many nearest neighbour document retrieval systems have been described in the research 
literature and operational implementations of some of these ideas are now available 
(Willett, 1988a). To date, the great bulk of this work has been carried out with 
English language material, where stop-word lists and stemming routines have been 
available for many years. In order to be able to provide end-user access to databases 
of Slovene text—using the nearest neighbour searching techniques—there is a need for 
an appropriate stop-word list and stemming algorithm. 
The only work which has been carried out in this area to date is the M.Se. the-
sis of Dimec (1988), which reports a computer analysis of the frequency and growth 
characteristics of the Slovene language as a basis for the development of an automatic 
indexing system for Slovene medical literature. Two stop-word lists were created in 
this study; in addition, a simple stemming algorithm was designed, based on the use 
of a list of 381 suffixes. However, Dimec (1988) notes that there are many limitations 
with the procedure as described, these resulting in large part from the small number 
of suffixes used and from the very complex morphology of the Slovene language. 
Thus, a Slovene language free-text retrieval system, based on the use of nearest 
neighbour searching techniques, demands the follovving: 
• the creation of a general purpose stop-word list, which is not restrieted to medical 
literature; 
• the design of a more powerful stemming algorithm that takes greater account of 
the language's morphological strueture. 
In order to be able to accomplish these two objeetives a detailed knowledge of the main 
characteristics of the Slovene language is required. 
56 
Chapter 3 
Main Characteristics of the 
Slovene Language 
3.1 Introduction 
Slovene is spoken by some two million people living mainly in the western part of 
Yugoslavia, i.e., in the Republic of Slovenia, one of the six federal units comprising 
that country. In Italy (Trieste, Gorizia, Julian Venetia and Retia) and in Austria 
(Carinthia) it is spoken by Slovene minorities, elsewhere, especially in America, by 
Slovene emigrants. 
Linguistically, Slovene is a South Slavic language with a speech area to the west of 
Serbocroatian, wedging into the Italian, German, and Hungarian linguistic territories in 
the extreme eastern spurs of the Alps. One of the main characteristics of contemporary 
Slovene is a profound discrepancy between its written and spoken form. Although the 
written Slovene language was created in the 16th century and intended for religious 
use at first, it slowly expanded its communicative function, via a wide range of dialects 
which differed from district to district. Indeed, it has retained an unusual degree of 
vigour and distinctiveness up to the present day. Therefore, contemporary Slovene 
literary language, prescribed for educated speech, described in its grammar books and 
57 
used in literature, scholarship and in the communication media, represents only its 
standard form and has been labeled as a "Schriftsprache", a "book language"; in other 
words, an artificial language (Lencek, 1982). In the literature, it is usually stressed 
that there is no such thing as standard spoken Slovene; that the gap between the 
natural language or dialects, and the written language is almost unbridgeable. This 
gap consists of such features as stress placement and tense/lax vowel quality, which 
are not refiected in the writing system (Bidwell, 1969); thus, the written form gives an 
impression of unity and consistency not actualb/ present in the natural language. It is 
in this sense that the contemporary Slovene language is indeed more a book-language 
than any other Slavic language (Lencek, 1982). 
However, literary Slovene represents the only common standard which unites the 
speakers of Slovene. This point is very important for a wide range of different so-
cioeconomic, political, educational, and cultural activities in Slovenia, one of which is 
the design and development of information retrieval systems. Deriving from a need to 
design automatic word conflation procedures for the Slovene language it is, therefore, 
the main aim of this chapter to give a concise description of the main characteristics 
of contemporary Slovene language, particularly its inflectional morphology. On this 
basis, the essential requirements for a design of a stemming algorithm for Slovene will 
be defined. 
This chapter consist of tvvo main sections. In the first section, the Slovene alphabet 
and pronunciation are concisely outlined. The second section is much larger, and 
contains an analysis of the morphological structure of the Slovene language. First, the 
concept of word-formation is briefly summarized, followed by a detailed description of 
the Slovene inflectional morphology. Particular emphasis is given to the explanation 
of basic grammatical categories, word (formal) classes and morphemic alternations, 
occurring in both stems and suffixes during inflections. In the conclusion, the main 
points to be considered in a design of the Slovene stemming algorithm are outlined. 
Whenever possible, a comparison between Slovene and English grammar is also made, 
intended mainly to make the complexity of the Slovene morphology clearer to the 
English reader. 
58 
The literature which was consulted during the work on this chapter is listed in the 
Appendix A. However, it is necessarv to emphasize that three sources in particular 
were most helpful, i.e., Lencek (1982), and Toporišiè (1975; 1984). In addition, many 
examples which serve in this chapter to illustrate the structure of the Slovene language 
were taken from Toporišiè (1975). 
It is perhaps pointless to say that the sources listed in the Appendix A represent 
a far more exhaustive and comprehensive analysis of the Slovene language than the 
description presented in this chapter. Therefore, readers who are deeply interested 
in the structure of Slovene are strongly advised to use the literature included in the 
Appendix A. 
3.2 The Slovene alphabet and pronunciation 
Contemporary Slovene has twenty-five letters. Their order is as follows: a, b, c, è, d, 
e, f, g, h, i, j, k, 1, m, n, o, p, r, s, š, t, u, v, z, ž. In foreign words the letters q, w, x, y 
may also appear; in the alphabet, q stands betvveen p and r, w, x, y between v and z. 
As in the English alphabet, ali letters can be divided into two main groups, i.e., vowels 
and consonants. 
3.2.1 Vowels 
The Slovene alphabet has five vowels: a, e, i, o, u. The pronunciation of the vowels in 
Slovene is quite complicated since there are three accent marks to be observed (i.e., 
indicates long close vowels, N indicates short open vowels, * indicates long open vowels), 
yet they are not placed over vowels in the written language. For example, e can be 
pronounced in the following ways: 
• e (verjeti - to believe); it is a sound similar to the first part of the English word 
aim; 
59 
• e (vse - evervthing); it is a sound similar to the first part of the English word 
everything; 
• e (Vera - Vera, name); it is a sound similar to the middle part of the English 
word man. 
In addition, e can also be pronounced as the unstressed e (for example, dedek - grand-
father, where the second e is reduced, like the sound at the beginning of the English 
word about). 
Because of the discrepancy between vowels in the written form and their vocalic 
sounds, it is not surprising that a common view expressed by foreign speakers is that 
Slovene pronunciation is "impossible" to learn (Tollefson, 1981). 
3.2.2 Consonants 
There are 20 consonants in the Slovene alphabet: b, c, è, d, f, g, h, j, k, 1, m, n, p, r, 
s, š, t, u, v, z, ž. The Slovene consonants are mostly pronounced as they are spelled. 
The consonants which are different from English either in spelling or pronunciation are 
given below: 
• c is pronounced as tz, as in the English word tzar, 
• è is pronounced as ch, as in the English word church; 
• g is pronounced as g, as in the English word gun; 
• h is pronounced as the German ch (Dach) and not as the English h (he); 
• j is pronounced as y, as in the English word yet; 
• / preceding a vowel is pronounced as a middle or European /: šola - German 
Schule. The pronunciation of the final / is similar to the pronunciation of the 
English w; 
• s is pronounced as s, as in the English word sit; 
60 
• š is pronounced as sh, as in the English word show; 
• t; has three sounds: 
— it is a true v (English v) when preceding a vowel; 
— it is pronounced as a M), before a consonant or if it is the final letter of a 
word; 
— sometimes it is a true u (English u), especially when both preceded and 
forwarded by consonants; 
• z is pronounced as z, as in the English vrord zero; 
• ž is pronounced as s, as in the English word measure. 
For the purpose of a description of morphemic alternations caused by the Slovene 
inflectional morphology, it is perhaps useful to define a further functional classification 
of consonants. They can be divided into the following groups: 
• SONORANTS: v, m, n, r, l, j 
• OBSTRUENTS: 
— voiced: b, d, g, z, ž 
— voiceless: p, t, k, s, š, c, è, J, h 
As will be outlined in the following sections, the position of sonorants or obstruents at 
the end of the stem plays an important role in the changes of both stems and sufRxes 
during the inflection of words. 
3.3 Morphological structure of the Slovene language 
3.3.1 The concept of word formation 
Before saying anything about the morphological structure of the Slovene language, it 
is necessary to introduce briefly—in order to understand the main prerequisites for a 
61 
design of the automatic word conflation—the concept of word formation in Slovene. 
Using a highly simplified notion (for a comprehensive analysis of the theory of word 
formation in the Slovene language see Vidoviè-Muha, 1988), it can be stressed that 
formation of words in Slovene does not differ much from those languages where new 
word forms are created using a stem with the addition of derivational suffixes. Although 
many distinctive words can be created from one stem in this way, they stili usually have 
a similar meaning, as for example: 
Slovene English 
RAZISKAVA RESEARCH 
RAZISKOVALEC RESEARCHER 
RAZISKOVATI RESEARCH 
This feature of the Slovene language is extremely important because it indicates the 
use of right-hand truncation as the best way to achieve word conflation. 
However, Slovene is characterized by a wider range of derivational sumxes than 
is English (for a comprehensive list of derivational endings see Toporišiè, 1984). It 
is Slovene inflectional morphology which formally distinguishes both languages. This 
part of the structure of the Slovene language will be described in the following sections. 
Such an approach will be justified at the end of this chapter where ali possible variants 
of the above stem RAZISKOVA- will be listed. 
3.3.2 Inflectional morphology of Slovene 
In order to be able to describe the morphological complexity of the Slovene language, 
it is necessary to introduce three main concepts: word (formal) classes, inflection, and 
grammatical categories. 
The concept of basic word classes is at the heart of the Slovene morphology. Accord-
ing to Toporišiè (1984), the Slovene language has been characterized by the following 
nine word classes: 
1. substantive words: 
(a) noun (hiša, otrok, tla; house, child, floor) 
62 
(b) verbal noun (dejanje, skrb, petje; action, èare, singing) 
(c) substantival adjective word (dežurna; on duty) 
(d) substantive pronoun (jaz, ti, on, kdo; I, you, he, who) 
2. adjective words: 
(a) adjective (lep -a -o; pretty) 
(b) numeral (en -a -o, prvi -a -o; one, first) 
(c) adjective pronoun (tak -a -o, moj -a -e; such, my) 
3. verb: 
(a) personal forms (dela -m -te; I'm morking, you're working) 
(b) descriptive participle ending in -/, and -n/-t (delal, delan, ubit; worked, 
murdered); 
(c) non-inflectional forms (delaje, delajoè, delati, delat; to work) 
4. adverb (doma, vèeraj, zakaj; at home, yesterday, why) 
5. predicate (všeè, treba, tiho; wished for, it is necessary, quiet) 
6. preposition (do, za, brez; to, for, without) 
7. conjunction (in, toda, èe; and, but, if) 
8. copula (samo, tudi, paè; only, also, indeed) 
9. interjection (ej; ah). 
The main feature of the above listed word (formal) classes is their division into inflec 
tional and non-inflectional categories. While substantive words, adjective words and 
the verb constitute the former group, the adverb, predicate, preposition, conjunction, 
copula, and interjection share characteristics of the latter group. It is important to 
emphasize at this stage that inflection of the members of the Slovene word (formal) 
classes is carried out by the application of different endings, known also as inflectional 
63 
suffixes. Consequently, taking the direction towards the design of the automatic right-
hand truncation—usually based on the list of endings—as the best way to achieve 
word confiation can again be supported. It is, therefore, mainly the infiectional group 
of words which will be described in the following sections. This decision can also be 
justified by the fact that most of the non-inflectional words (e.g., prepositions, con-
junctions) belong to the so-called non-content bearing words and can, therefore, be 
considered as candidates for a list of stop-words in an information retrieval system. 
To outline and illustrate the Slovene infiectional morphology, the concept of gram-
matical categories has to be introduced. By this concept, the general formal and se-
mantic properties which bring together words of different concrete-lexical meanings 
into the same form-class are described (Lencek, 1982). The grammatical categories are 
inherent, as gender in substantives and aspect in verbs, or syntactically determined, as 
gender and number in adjectives. Together with their word (formal) classes they make 
up the paradigmatic system of Slovene morphology. 
In general, Slovene shares its grammatical categories with other Slavic languages. 
It differs from them in that it possesses the category of dual in addition to singular and 
plural, and that its nominal system does not possess a special morphological form to 
express an appeal (vocative). 
According to Lencek (1982), the basic grammatical categories of the Slovene nominal 
"parts of speech" are: gender, number, èase, the animate/inanimate distinction in 
substantives, the definitive/indefinitive, and the positive/nonpositive oppositions in 
adjectives. The basic grammatical categories of the Slovene verbal forms are: aspect, 
voice, person, number (and marginally gender), tense, and mood. The presence or 
absence of some of these categories in an infiectional form makes the infiectional forms 
of Slovene conjugation finite and nonfinite; a finite form is marked by the category of 
person (and sometimes gender), a nonfinite form does not express person. In addition, 
the Slovene infiectional forms are either simple (e.g., the present tense, imperative) or 
compound (e.g., the past, future). 
A "match" of these grammatical categories with their word (formal) classes is shown 
64 
Word class 
noun 
verbal noun 
subs. pronoun 
adjective 
numeral 
adj. pronoun 
verb (pers.form) 
descr. partic. 
adverb 
predicate 
preposition 
conjunction 
copula 
interjection 
Gender 
X 
X 
X 
X 
X 
X 
X 
X 
Numb. 
X 
X 
X 
X 
X 
X 
X 
X 
Èase 
X 
X 
X 
X 
X 
X 
X 
X 
Pers. 
X 
X 
X 
X 
X 
X 
Degree 
X 
Aspect 
X 
Table 3.1: Word (formal) classes and grammatical categories 
in Table 3.1 which illustrates the inflectional complexity of the Slovene language. 
A "match" of the grammatical categories and word (formal) classes results in dif-
ferent inflectional patterns. For example, declension is a feature of ali substantive and 
adjective word classes since in Slovene the relationship between words in sentences is 
expressed by the application of six cases. 
Bearing in mind the design of a stemming algorithm for the Slovene language, it 
is important at this point again to emphasize that ali inflected word forms in Slovene 
consist of two main parts: a stem and a sumx. While a stem can be defined as 
the content-bearing part of the inflected word, a suffix represents those units of the 
inflected word which mark its gender, number, èase, person, etc. The following are 
some examples: 
PERSON: dela -m -š - .. .(ivorking); 
ÈASE: lip -a -e-i ...(lime-tree); 
GENDER: lep - -a-o ... {pretty); 
NUMBER: lep - -a-i ... (pretty). 
It is evident that the employment of suffixes plays a major role in the inflectional 
morphology of the Slovene language. To explain how and to what extent this affects 
the development of a Slovene stemming algorithm, illustrations of the morphological 
65 
structure of the Slovene language will be given in the subsections below. The frame-
work of this description will be based on the grammatical categories, starting with the 
categorv of gender. 
3.3.3 The category of gender 
The Slovene language distinguishes three genders: the masculine (on - he), the feminine 
(ona - she) and the neuter (ono - it). It is interesting to note that in the majoritv of 
the Slavic languages, gender is inherent in substantives, inflected in adjectives, and not 
expressed in pronouns (Lencek, 1982). Slovene, however, has extended gender to per-
sonal pronouns, and marginallv to verbal inflection. The following are some examples 
of the application of gender in Slovene: 
NOUN 
ADJECTIVE 
VERB 
(PARTICIPLE) 
PRONOUN 
Masculine 
brat - brother 
stol - chair 
med - honey 
lep 
slovenski 
mlad 
delal 
zaželen 
pil 
moj 
on 
mi 
Feminine 
sestra - sister 
miza - table 
knjiga - book 
lepa 
slovenska 
mlada 
delala 
zaželena 
pila 
moja 
ona 
me 
Neuter 
dete - baby 
mesto - totvn 
sonce - sun 
lepo 
slovensko 
mlado 
delalo 
zaželeno 
pilo 
moje 
ono 
me 
pretty 
Slovene 
young 
voorked 
desired 
drunk 
my 
he, she, it 
we 
As can be seen from the examples above—although there are some departures from 
the rules—words ending in the nominative èase singular in -a are mostlv feminine 
(miza), in -o and -e are mostlv neuter (dete, mesto), and in a consonant, -i or -u are 
mostlv masculine (brat, slovenski). The follovving two sentences illustrate the use of 
gender in Slovene and compare it to English: 
Slovene English 
Tvoj novi prijatelj je raziskovalec. Your new friend is a researcher. 
Tvoja. nova. prijateljica je raziskovalka. Your new friend is a researcher. 
66 
3.3.4 The category of number 
The morphology of Slovene is unusual because besides the singular and plural, the dual 
is also used when referring to two persons or objects. The following are some examples 
of this: 
Singular Dual Plural 
ena miza. (one table) dve mizi (two tables) tri mize (three tables) 
eno mesto (a town) dve mesti (two tovons) tri mesta, (three towns) 
lep (pretty - M) lepa (pretty - M) lepi (pretty - M) 
lepa (pretty - F) lepi (pretty - F) lepe (pretty - F) 
on (he) onadva (they two) oni (they) 
As illustrated above, the category of number is not only applied to nouns, but also 
to adjectives and pronouns. This feature distinguishes Slovene sharply from English 
since we know that the latter is characterized by singular and plural, and that the most 
common sufflxes in the plural are either -s or -es. 
3.3.5 The category of èase 
Hovvever, one of the most striking differences between Slovene and English morphology 
is the fact that in Slovene the inflection denotes not only the number form, but also the 
relationship of individual vrords in the sentence, which in English is expressed by the 
use of prepositions. These forms are called cases. There are six of them in the Slovene 
language: Nominative, Genitive, Dative, Accusative, Locative, Instrumental. 
The category of èase is relevant to the following word (formal) classes: nouns, verbal 
nouns, adjectives, and pronouns. The examples listed below illustrate the consequences 
of the application of the category of èase—a phenomenon known as declension—and 
are mainly related to the increased number of new suffixes. 
The declension of nouns follows the following four patterns mainly: 
1. miz-a 
2. nit-
Singular 
-e -i -o 
-i -i -
-i 
-i 
-0 
-jo 
miz-e 
nit-i 
Plural 
-am 
-i -im 
-e -ah 
-i -ih 
-ami 
-mi 
67 
3. korak-
4. mest-o 
-a -u 
-a -u 
1. 
2. 
3. 
4. 
-u 
-o -u 
mi z-i 
nit-i 
korak-a 
mest-i 
-om 
-o 
Dual 
-
-i 
-ov 
-
korak-i 
mest-a 
-ama -i -ah 
-ima -i -ih 
-orna -a -ih 
-orna -i -ih 
-ov -om 
-om 
-ama 
-ima 
-i 
-i 
-e -ih 
-a -ih 
-i 
-i 
The following sentences, using only the singular of the word mesto (a town), illustrate 
the use of cases in Slovene: 
Slovene English 
1. To je mesto. This is a town. 
2. Ne vidim nobenega mesta.. I can't see any town. 
3. Približujem se mestu. Vm voalking tomards this town. 
4. Kako bi opisal to mesto? How mould you describe this town? 
5. Kdo živi v tem mestu? Who lives in this town? 
6. Pod tem mestom teèe reka. There is a river beneath the town. 
The second group of word (formal) classes undergoing declension are adjectives; they 
are declined as follows: 
Singular Plural 
M lep- -ega -emu -/-ega -em -im lep-i -ih -im -e -ih -imi 
F lep-a -e -i -o -i -o lep-a -ih -im -a -ih -imi 
N lep-o -ega -emu -o/-ega-em -im lep-a -ih -im -a -ih -imi 
Dual 
M lep-a -ih -ima -a -ih -ima 
F lep-i -ih -ima -i -ih -ima 
N lep-i -ih -ima -i -ih -ima 
The remaining two categories, i.e., pronouns and numerals, are declined in the same 
way as adjectives. Since they and, in particular, pronouns can be defined as function 
words and, therefore, are considered as candidates for a stop-word list, their infiectional 
characteristics indicate that a list of stop-words in Slovene will be more comprehensive 
than one created for English. To illustrate this point, the following is an example of 
how the interrogative pronoun kaj (what) can be declined: 
kaj èesa èemu kaj èem èim (ivhat) 
68 
3.3.6 The category of degree 
The gradation of adjectives and adverbs in Slovene can be defined by a process that is 
formally quite similar to that in English. As in the English language, there are three 
degrees of comparison in Slovene: the positive (the non-compared fundamental form), 
the comparative and the superlative. In addition, both languages are characterized by 
two forms of comparison: 
• the comparative is formed by means of the adverb bolj (more), and the superlative 
najbolj (the most), placed before the adjective or adverb: 
bel (white) bolj bel najbolj bel 
• the comparative is formed by suffixes -ejši, -ši, and -ji (in English -er), the su 
perlative by adding the prefix naj- to the comparative (in English by the article 
the and the suffix -est, e.g., the biggest). The follovving are some examples of this 
form of comparison: 
star (old) starejši (older) najstarejši (the oldest) 
dolg (long) daljši (longer) najdaljši (the longest) 
nizek (low) nižji (lower) najnižji (the lowest) 
However, it is necessary to stress that the gradation of adjectives and adverbs in Slovene 
belongs to the word-formation concept. Thus, terms created during the gradation can 
be defined as new word forms and are characterized by gender, number, èase, etc. For 
example, starejši can be declined as any other adjective. 
3.3.7 Grammatical categories of the verbal forms 
The large amount of literature devoted to the Slovene verbal system (for comprehensive 
descriptions see Paternost, 1963; Lencek, 1966; Toporišiè, 1984) indicates that the verb 
is at the heart of both the word-formation theory and morphology of the Slovene 
language. Since it is not the intention of this chapter to outline the Slovene language 
in detail, the description of the main verbal categories has only one simple aim, i.e., to 
69 
illustrate how the employment of different verbal forms can significantly increase the 
number of sufflxes and therefore influence the design of the stemming algorithm for the 
Slovene language. 
According to Toporišiè (1975), the principal verbal forms in Slovene are the infmitive 
(ending in -ti or -èi) and the present tense (ending in the lst person singular in -m). 
How other verbal forms deri ve from them can again be shown using the word delati (to 
work): 
1. infmitive: delati; 
2. supine: delat; 
3. participle ending in -/: delal; 
4. participle ending in -n: delan; 
5. verbal noun: delanje; 
6. present tense: de/am; 
7. imperative: delaj; 
8. participle ending in -t. delajoè. 
The main grammatical categories of the verbal forms are person, tense, aspect and 
mood. The examples below illustrate how their employment affects the behaviour of 
the verb. 
The category of person 
Verbal forms are related to the three types of the category of person (lst, 2nd, 3rd 
person). The following is an example of the conjugation of the verb delati (to work): 
lst pers. 
2nd pers. 
3rd pers. 
Singular 
dela -m 
-š 
-
Dual 
dela -va 
-ta 
-ta 
Plural 
dela -rao 
-te 
jo 
70 
The category of tense 
There are four tenses in Slovene. Except for the present tense, they are ali formed with 
the participle ending in -/ (or in -n or -t) and the auxiliary: 
Present tense: delam (J ivork); 
Past tense: delal sem (/ ivorked); 
Future tense: delal bora (I shall ivork); 
Pluperfect tense: delal sem bil (/ had ivorked). 
The category of mood 
There are three moods in Slovene: 
Indicative: delam (I work), delal sem (/ ivorked); 
Imperative: delaj (ivork), delajmo (let us ivork); 
Conditional: delal bi (/ ivould ivork), delal bi bil (/ ivould have ivorked). 
The categorv of aspect 
Every verb obligatorily belongs to one of two classes of aspect: perfective or imperfec 
tive. Thus, the majority of Slovene verbs occur in two formal varieties, one of which 
implies that the action is understood as limited (a perfective verb, e.g., to reach), and 
the other that the action is understood as unlimited (an imperfective verb, e.g., to be 
reaching). The contrast between them is expressed not only by different suffixes, but 
also by a radical alternation of the stem; the latter is of particular concern to the design 
of the automatic word conflation algorithm for the Slovene language. The following are 
some examples of perfective verbs, follovved by imperfective verbs; they illustrate both 
the appearance of new suffixes and alternations within stems: 
dvigniti - dvigati to lift - to be lifting 
prenesti- prenašati to transfer - to be transferring 
seèi - segati to reach - to be reaching 
priti - prihajati to come - to be coming 
71 
3.4 Types of morphemic alternations 
As pointed out in previous sections, the characteristics of the infiectional morphology 
of Slovene, together with the concept of word-formation, constitute a starting-point for 
the design of the Slovene stemming algorithm. However, there is one additional feature 
of the Slovene language which has to be considered in the development of any word 
conflation procedure. This feature corresponds to the frequent alternation occurring in 
both stems and suffixes during the inflection of word forms. Whilst the rich infiectional 
morphology of the Slovene language indicates a need mainly for developing an extensive 
list of suffixes, the process of alternation not only causes new endings to be added to 
the suffix list, but also requires the introduction of context-sensitive and recoding rules 
as a part of the automatic word conflation procedure. 
Although some types of alternation have already been noted in the examples illus-
trating the category of aspect (prenesti - prenašati) and the category of degree (nizek 
- nižji), the main aim of this section is to describe and illustrate concisely the basic 
types of modifications. Two main sources have been consulted in the preparation of 
this section, the first being Lencek (1982), and the second, Toporišiè (1984). 
According to Lencek (1982), there are three basic alternation types—prosodic, vo-
calic, and consonantal—common to both the nominal and verbal systems in Slovene. 
Since prosodic alternations mainly involve alternations of stress, they are not relevant to 
the written form of the Slovene language. Thus, the next subsections will be concerned 
with vocalic and consonantal modifications, occurring in both suffixes and stems. 
3.4.1 Vocalic alternations 
There are two types of vocalic alternation which are potentially interesting to the design 
of a stemming algorithm for Slovene: 
• the votvel ~ zero alternations; 
• the grave ~ acute vowel alternations of the o ~ e type. 
72 
In the first type of alternation, there are four vowels, i.e., e, i, o, a, which alternate 
with zero in the nominal system of the Slovene language. A zero can be defined as a 
consonantal cluster, particularly of a consonant + sonorant type, occurring at the end 
of a stem. 
The following are some examples of the e ~ zero alternation, occurring during the 
infiection of words: 
veter - vetra (wind) 
dinozaver - dinozavra (dinosaur) 
bolezen - bolezni (illness) 
miren - mirnega (peaceful) 
sestra - sester (sister) 
brati - berem (to read) 
As far as longer word forms are concerned (for example, dinozaver - dinozavra) 
this type of modification should not represent a serious problem in the design of the 
stemming algorithm; a deletion of the suffixes -er or -ra can simply be employed, 
producing the stem dinozav-. However, a removal of these two endings from shorter 
terms, for example, veter - vetra would result in a stem vet. Consequently, this can 
cause a serious overstemming problem (consider a term vet-6). On the other hand, 
leaving both terms untouched would result in serious understemming. It is obvious that 
these examples simply indicate a need for introducing recoding rules as an inevitable 
part of the Slovene stemming algorithm. 
The second type of alternation (i.e., the o ~ e type) can be described as follows. If 
a letter before a suffix beginning with -o is one of the consonants c, è, ž, š, j, then the 
grave vowel -o is automatically changed to the acute vowel -e. The following are two 
examples: 
fantom (instr. èase of boy) vs. stricem (instr. èase of uncle) 
dekletom (instr. èase of girl) vs. mladenièem (instr. èase of youngster) 
As far as the design of the Slovene stemming algorithm is concerned, the main 
effect of this type of modification is again an increased number of endings becoming 
candidates for the suffix list. 
73 
3.4.2 Consonantal alternations 
The two types of consonantal alternation in the Slovene language are known as substi-
tutive softenings of Type I and Type II (Lencek, 1982). 
The substitutive softenings of Type I, known also as K ~ È, are: 
fc ~ è i~è sk ~ šè 
g ~ i c~è si ~ šè 
h ~ š d ~ j zd ~ ž 
s ~ š si ~ š/; 
z ~ i 
It has to be stressed that the K ~ Ètypes of alternations are mainly found in verbal 
inflections; in addition, they also occur in the formation of comparatives. The following 
are some examples: 
jokati - joèem visok - višji zgubiti - zgubljen 
strgati - striem tanek - tanjši prelomiti - prelomljen 
prenesti - prenašati daleè- dalje prenoviti - prenovljen 
The substitutive softenings of Type II, known also as K ~ C, have only two in-
stances: 
k ~ c 
g ~ z 
These alternations are also mainly a characteristic of the verbal inflection, particu-
larly the imperative mood, e.g.: 
rek (stem of saj/): reci, recite; 
leg (stem of lay): lezi, ležite 
In general, consonantal alternations are very frequent and thus can introduce addi-
tional requirements into the design of the stemming algorithm. 
f r*j T j 
/~ lj 
n ~ nj 
P ~ pij 
b ~ bij 
v ~ vlj 
m ~ mlj 
74 
3.4.3 Truncation 
In addition to vocalic and consonantal alternations, the Slovene verbal inflection is also 
characterized by truncation, which consists of the modification of a basic verbal stem 
(Lencek, 1982). 
Truncation very often involves deeper changes in the stem. Thus, the vocalic stems 
ending in -ov-a-, when truncated, change their -ov- to -uj-. For example: 
raziskav-a: raziskuj, raziskujem, raziskujeta 
darov-a: daruj, darujem, darujejo 
Apart from the examples of the alternations outlined above, there are many more 
cases of morphemic modifications occurring in the Slovene inflectional system. They 
are concisely described in a highly systematic way by Toporišiè (1984). 
3.4.4 Complexity of Slovene morphologv, using an example 
On the basis of the discussion above about the morphological structure of the Slovene 
language, two main points can be emphasized: 
• the Slovene language displays features of the extremely rich inflectional morphol-
ogy in both verbal and nominal systems; 
• in addition, Slovene is characterized by various types of morphemic alternations, 
occurring in both stems and suffixes during the inflection. 
As described at the beginning of this chapter, we can now return to the stem RAZISKOVA-
(research) and give a list of ali its variants to illustrate the above two points: 
RAZISKAVA RAZISKOVANJ RAZISKANI 
RAZISKAVE RAZISKOVANJEMA RAZISKANO 
RAZISKAVI RAZISKOVANJIH RAZISKUJEM 
RAZISKAVO RAZISKOVANJI RAZISKUJEŠ 
RAZISKAV RAZISKOVALNI RAZISKUJE 
RAZISKAVAMA RAZISKOVALNEGA RAZISKUJEVA 
RAZISKAVAH RAZISKOVALNEMU RAZISKUJETA 
75 
RAZISKAVAM 
RAZISKAVAMI 
RAZISKOVALEC 
RAZISKOVALCA 
RAZISKOVALCU 
RAZISKOVALCEM 
RAZISKOVALCEV 
RAZISKOVALCEMA 
RAZISKOVALCIH 
RAZISKOVALCI 
RAZISKOVALCE 
RAZISKOVALKA 
RAZISKOVALKE 
RAZISKOVALKI 
RAZISKOVALKO 
RAZISKOVALK 
RAZISKOVALKAMA 
RAZISKOVALKAH 
RAZISKOVALKAM 
RAZISKOVALKAMI 
RAZISKOVANJE 
RAZISKOVANJA 
RAZISKOVANJU 
RAZISKOVANJEM 
RAZISKOVANJI 
In addition to 94 variants of the stem RAZISKOVA—as identified by the author— 
the following are some other examples, showing only a stem and number of its variants. 
These examples are taken from the experimental text corpus—as described in the next 
chapter—and, therefore, do not illustrate ali possible forms of the particular stem. 
Stem 
RAZVOJ 
UPORAB 
INFOR 
SPECIAL 
SISTEM 
STROK 
Number of variants 
43 
41 
35 
26 
25 
24 
76 
RAZISKOVALNEM 
RAZISKOVALNIM 
RAZISKOVALNA 
RAZISKOVALNIH 
RAZISKOVALNIMA 
RAZISKOVALNIMI 
RAZISKOVALNE 
RAZISKOVALNO 
RAZISKOVATI 
RAZISKATI 
RAZISKAT 
RAZISKAL 
RAZISKALA 
RAZISKALI 
RAZISKAN 
RAZISKANEGA 
RAZISKANEMU 
RAZISKANEM 
RAZISKANIM 
RAZISKANA 
RAZISKANIH 
RAZISKANIMA 
RAZISKANIMI 
RAZISKANE 
RAZISKUJEMO 
RAZISKUJEJO 
RAZISKUJ 
RAZISKUJVA 
RAZISKUJTA 
RAZISKUJMO 
RAZISKUJTE 
RAZIŠÈI 
RAZIŠÈIVA 
RAZIŠÈITE 
RAZIŠÈIMO 
RAZIŠÈEJO 
RAZISKUJOÈ 
RAZISKUJOÈEGA 
RAZISKUJOÈEMU 
RAZISKUJOÈEM 
RAZISKUJOÈIM 
RAZISKUJOÈA 
RAZISKUJOÈIH 
RAZISKUJOÈIMA 
RAZISKUJOÈE 
RAZISKUJOÈIMI 
RAZISKUJOÈI 
RAZISKUJOÈO 
3.5 Conclusions 
Having listed a total of 94 variants for the stem RAZISKOVA- (research) and thus 
illustrating, using also some other words, the complexity of Slovene morphology, both 
in suffix variations and morphemic alternations, the following conclusions concerning 
further work on the stemming algorithm for the Slovene language can be drawn: 
• Right-hand truncation, if properly designed, can play an enormously important 
role in improving both the effectiveness and efficiency of Slovene text retrieval 
systems; 
• Morphemic alternations which very often cause deeper changes in a stem and 
are a frequent phenomenon in Slovene inflectional morphology, particularly in 
its verbal system, impose serious limitations on the idea of manual right-hand 
truncation; 
• Bearing in mind other disadvantages of manual right-hand truncation, as de-
scribed in previous chapters, the decision to design an automatic conflation pro 
cedure seems to be the most appropriate solution for the Slovene IR environment; 
• Familiarity with the characteristics of the Slovene language indicates that the 
design of an automatic conflation procedure will involve the following factors: 
1. A list of stop-words will comprise a large number of terms. 
2. The best way of achieving automatic word conflation seems to be by devel-
oping a stemming algorithm, based on the longest-match principle. Trying 
to establish iteration patterns seems to be an extremely difficult and almost 
impossible task. 
3. To implement the longest-match principle, a list of suffixes is needed. There 
are indications that the list of endings will share some characteristics of the 
stop-word list, i.e., that of being very comprehensive. 
4. The stemming algorithm will necessarily require context-sensitive and recod-
ing rules; the latter will be particularly important to avoid overstemming in 
77 
word forms having five or less characters. 
5. However, the main aim of the design process will be to obtain a reasonable 
balance between, on the one hand, the number of rules, and on the other 
hand, simplicity and efficiency of processing. 
78 
Chapter 4 
Development of a Stemming 
Algorithm for the Slovene 
Language 
4.1 Introduction 
4.1.1 Information retrieval research in Slovenia 
When thinking about the development and design of a stemming algorithm for the 
Slovene language, considering in particular its morphological complexity, an interesting, 
but also contradictory situation comes to light. While, on the one hand, a certain degree 
of progress has been achieved in natural language processing research over the last ten 
years, information retrieval research, on the other hand, has taken a very small part 
in developing modern, non-conventional techniques to improve the effectiveness and 
efficiency of retrieval systems. 
Research in natural language processing has been mainly carried out at the Institut 
Jožef Štefan in Ljubljana. The main aim of this research has been to develop natural 
language understanding concepts (Tancig, 1985). On the basis of a detailed analysis of 
the syntax of the Slovene language and using artincial intelligence methods, primary 
79 
interest has been focused in developing semantic schemes for the Slovene language. 
Although some of the research projects have contributed towards development in this 
area, there has been a significant lack of application of research results, particularly 
caused by the characteristics of Slovene morphology. In addition, there has so far been 
no link with the information retrieval research community for the potential transfer of 
research results to improve access to existing bibliographic databases. 
From the 1970s onvvards, a number of databases have been created in specialized 
information centres and libraries in Slovenia. At present, according to the report pub-
lished by the Research Comrmmity of Slovenia (1989), there are 49 databases, 27 of 
which are bibliographic. It is interesting to note that the contents of the bibliographic 
databases are represented only by descriptors, and not by abstracts. With regard 
to programs, a number of different information retrieval systems are used, the most 
popular being the TRIP system, developed by PARALOG (1990) and employed on a 
VAX/VMS mainframe at the University of Ljubljana, and the ATLASS system, de 
veloped by the Institute for Information Science, University of Maribor (1990) and 
employed also on a VAX/VMS mainframe. However, ali these systems share the char 
acteristics of conventional retrieval systems, i.e., Boolean searching techniques are em-
ployed and professional intermediaries are needed to carry out on-line searches on behalf 
of end-users. Furthermore, the effectiveness and efficiency of these systems have rarely 
been evaluated. Consequently, modern, non-conventional methods and techniques of 
information retrieval, for example, automatic indexing, best-match searching, and term 
weighting, are neither incorporated into existing retrieval systems in Slovenia, nor has 
any research—with one exception described below—been carried out in this area. 
However, there is no doubt that with developments in software techno!ogy, with 
the growing number of bibliographic and other types of databases in Slovenia and with 
increasing user demand for accurate and up-to-date information, the area of modern 
information retrieval research will become very attractive as an alternative to con 
ventional, Boolean information retrieval. Therefore, the implementation of a Slovene 
language-based free-text retrieval system, particularly the design of a stemming algo-
rithm which is the main scope of this PhD research project, together with results of 
80 
experimental research work carried out by Dimec (1988) as described below, serves as an 
important starting-point for the development of a new generation of statistically-based 
free-text retrieval systems for Slovene textual databases. 
4.1.2 Computer analysis of the Slovene language in medicine 
A computer analysis of the Slovene language in medicine (Dimec, 1988) has, so far, 
been the only published research report in Slovenia using statistically-based techniques 
for information retrieval. The main aim of the project was to find out the frequency and 
growth characteristics of Slovene free language in medicine for the potential employment 
of automatic indexing in medical retrieval systems. 
Research experiments were carried out on a text corpus which consisted of Slovene 
medical articles, taken from scientiiic journals. More than 30,000 words were included 
in this corpus. Apart from testing a term discrimination model (as described by Salton 
et al., 1975) and a two-Poisson distribution scheme (as described by Harter, 1975) 
as potential models for automatic term selection, a considerable amount of work was 
devoted to the compilation of both a stop-word list and a suffix list, particularly as a 
means of achieving dictionary compression in information retrieval. 
On the basis of the frequency distribution of words from the text corpus, and 
of considering the characteristics of medical terminology, two extensive lists of stop-
words were created. The first stop-word list (1,205 words) primarily included function 
words and number terms. The second stop-word list (2,866 words) consisted of terms 
which carried meaning but were thought not to be relevant to medical information re 
trieval. It is important to note that the second list included not only speciality words 
such as BOLEZEN (DISEASE), BOLNIK (PATIENT), MEDICINA (MEDICINE), but 
also other terms such as CENTIMETER (CENTIMETER), ÈAS (TIME), DEFINI 
CIJA (DEFINITION), NOVEMBER (NOVEMBER), GRAM (GRAMME), MESTO 
(TOWN), NAÈRT (PLAN), OPIS (DESCRIPTION), etc. It is, therefore, not surpris-
ing that the employment of both stop-word lists resulted in a 72% compression of the 
existing text corpus. This level of compression is extremely high, since, for example, 
81 
van Rijsbergen (1979) reports about 30% - 50% compression where similar procedures 
were applied to an English language text. It is obvious that the main reason for this 
high compression can be found in the second stop-word list which included many words 
of potentially low value only for medicine, but not necessarily for the Slovene language 
in general. Unfortunately, there is no evidence in the research report by Dimec (1988) 
as to how the design of the second stop-word list, which goes drastically beyond a core 
of function words, would be balanced against the effectiveness of the medical informa-
tion retrieval system if such a dictionary were employed in the medical database. The 
results of experiments in developing stop-word lists for the English language (see, for 
example, Jones and Bell, 1984) have shown that once a stop-word list includes more 
than just function words (i.e., prepositions, articles, pronouns, etc), the selection of 
stop-words becomes a very subjective process since the potential list appears to be 
virtually boundless. Consequently, the effectiveness of the information retrieval system 
can be seriously threatened. Thus, the majority of stop-word lists, developed so far for 
the English language, include only function words, words from phrases and sometimes 
speciality words. 
A simple conflation procedure was also developed as part of the research project 
by Dimec (1988) to avoid morphological variants of terms in the text corpus and to 
achieve additional compression. The conflation procedure ušes a list of 381 suffixes 
which were generated from the reversed, alphabetically sorted list of words from the 
text corpus. The longest-match method was employed in word conflation; each suffix 
also having an associated rule for minimum stem length. Despite the complexity of 
Slovene morphology there are no recoding rules, and as noted by Dimec (1988), many 
improvements are necessary to achieve better word conflation; for example, words such 
as JETRA - JETER (nominative and genitive èase of LIVER), or CELICA - CELIÈNA 
(noun and adjective of CELL) did not match, although they should have matched. Both 
words are characterized by morphemic alternation of the e ~ zero and c ~ è type, as 
described in detail in Chapter 3. 
Since there is no description of evaluation of this stemming algorithm—apart from 
testing the level of compression—Table 4.1 shows an example of the conflation results 
82 
when the procedure was applied to some variants of the familiar term RAZISKAVA. 
Word (input) 
RAZISKAN 
RAZISKAV 
RAZISKAVA 
RAZISKAVAH 
RAZISKAVE 
RAZISKAVI 
RAZISKAVO 
RAZISKOVALCA 
RAZISKOVALCE 
RAZISKOVALCEM 
RAZISKOVALCEV 
RAZISKOVALCI 
RAZISKOVALCU 
RAZISKOVALEC 
RAZISKOVALNA 
Stem (output) 
RAZISK 
RAZISKAV 
RAZISK 
RAZISKAV 
RAZISK 
RAZISK 
RAZISKAV 
RAZISKOVALC 
RAZISKOVAL 
RAZISKOVALC 
RAZISKOVAL 
RAZISKOVAL 
RAZISKOVALC 
RAZISKOVAL 
RAZISKOV 
Word (input) 
RAZISKOVALNE 
RAZISKOVALNEGA 
RAZISKOVALNEM 
RAZISKOVALNEMU 
RAZISKOVALNI 
RAZISKOVALNIH 
RAZISKOVALNIM 
RAZISKOVALNO 
RAZISKOVANJ 
RAZISKOVANJA 
RAZISKOVANJE 
RAZISKOVANJEM 
RAZISKOVANJU 
RAZISKUJE 
RAZISKUJEJO 
Stem (output) 
RAZISKOV 
RAZISKOV 
RAZISKOV 
RAZISKOV 
RAZISKOV 
RAZISKOV 
RAZISKOV 
RAZISKOV 
RAZISKOVANJ 
RAZISKOV 
RAZISKOV 
RAZISKOVAN 
RAZISKOV 
RAZISK 
RAZISKUJ 
Table 4.1: Results of conflation of variants of the word RAZISKAVA, using the algo-
rithm designed by Dimec (1988) 
As can be seen from Table 4.1, many enhancements in the algorithm are needed 
to improve the results of word conflation. Reducing the total of 30 variants of the 
stem RAZISK- to 9 different forms, and obtaining only 5 terms (16.6%) with the 
common stem RAZISK- indicates a need for further work on the stemming algorithm. 
In addition, poor performance results correlate to the small compression achieved in this 
text corpus. Dimec (1988) reports about 12% reduction of the text corpus that remained 
after application of the stop-word lists. This is a very low figure when compared to 
the results reported by Lennon et al. (1981). These authors have evaluated various 
stemming algorithms on different test collections in the English language and the level 
of compression obtained ranged from 26.2% to 50.5%. The main reason for the low 
figure obtained by Dimec (1988) arises from the relatively small list of suffixes, which 
is not in accordance with the complexity of the morphological structure of the Slovene 
language. 
83 
There is no doubt that the results of this research project should be viewed as an 
important step towards the implementation of modem information retrieval methods 
and techniques in the Slovene textual database environment. However, more experi-
mental work is needed in this area, particularly because some of the techniques so far 
developed and tested were based on a small text corpus and related only to medical 
terminology. 
4.1.3 A general framework for the design of a stemming algorithm 
for the Slovene language 
The most important objective of this project is the design of a powerful stemming al 
gorithm that takes account of the language's morphological structure. On the basis of 
a review of different stemming algorithms, as described in Chapter 2, and of consider-
ation of the complexity of Slovene morphology, as outlined in Chapter 3, the following 
framework of experimental research work was designed to achieve this objective. The 
research project was divided into two stages. The first stage of experimental work, 
based mainly on the so-called frequency approach, included the automatic compilation 
of a sufnx list, development of a stop-word list and the design of a simple confiation pro 
cedure. Experiments were carried out on two different text collections, one consisting 
of terms from library and information science articles, the other covering the general 
area. In each database, almost 60,000 terms were included. In addition, an English 
text corpus, also covering the area of librarianship and information science, was used to 
illustrate language-dependent requirements in the design of the stemming algorithm. 
The performance results of the frequency algorithm, the characteristics of the 
Slovene language, and some conclusions derived from work carried out by Dimec (1988), 
suggested an introduction to the second stage of the experimental research work, Le., a 
need to develop more sophisticated methods and techniques in the stemming algorithm 
if the objective was to achieve better confiation results. However, the main aim of the 
experimental work in the second stage was to obtain a reasonable balance between, on 
the one hand, the number of rules, and on the other hand, simplicity and efficiency of 
84 
the conflation procedure. 
Since the above outlined experiments were strictly applied in the design of the 
stemming algorithm, the course of the research project will be described in this chapter 
in the following order. First, the methodology employed in the experimental work 
will be outlined, followed by the analysis of the frequency distribution of words in the 
Slovene textual databases. On this basis the two main streams of the research work will 
be presented. Whilst in the first part the design and evaluation of both the stop-word 
list and a frequency algorithm will be outlined, the second part will describe the main 
points in the development and evaluation of the new stemming algorithm which will 
be incorporated into INSTRUCT. 
4.2 A methodological framework of the experimental work 
As pointed out above, there were two main objectives to be achieved as a result of the 
experimental research work: 
• the production of the stop-word list; 
• the design of the stemming algorithm, based on the list of sufRxes. 
Although there are many different approaches to the design of automatic conflation 
procedures, (see for example, Dawson, 1974; Field, 1975; Tarry, 1978; Hafer and Weiss, 
1974), a similar approach to those used in the RADCOL project (Lowe et al., 1973), the 
CITE system (Ulmschneider and Doszkocs, 1983), and in the computer analysis of the 
Slovene language in medicine (Dimec, 1988), was adopted as a first step in developing 
the stemming algorithm. Since this approach is based on the results of frequency 
distribution of words and suffixes in the textual databases, it can significantly help in 
the selection of terms to be included in the stop-word list, and provide a list of endings 
to be employed in the stemming algorithm. 
The following are the major steps which are usually taken if the frequency-based 
approach is applied to words in the textual databases: 
85 
1. ali words from the text corpus are extracted; 
2. these words are ranked by frequency of their occurrence; 
3. since the most frequent words are usually function words they are reviewed for 
their inclusion into the list of stop-words; 
4. a list of stop-words is created; stop-words are removed from further analysis; 
5. remaining terms are reversed and sorted into alphabetical order; 
6. adjacent words in the ordered list are then compared and whenever a match of N 
characters is found, strings containing 1,2,...,N characters are created; 
7. these strings are sorted, cumulated, and the most frequent endings are either 
directly empIoyed in the longest-match, context-free frequency algorithm or used 
as a starting-point for the manual selection of the sufflx list. 
There is one major advantage to the above described procedures, i.e., they can be 
almost completely automated and there is a very little need for any manual involvement. 
However, familiarity with the morphological structure of the Slovene language suggested 
the need for a certain degree of manual participation in both the development of the 
stop-word list and the design of the stemming algorithm. For example, since some of the 
word classes, in particular pronouns—although being defined as non-content bearing 
words—belong to the inflectional category of terms there is no guarantee that ali their 
forms will always appear in the upper part of the term distribution table. In addition, a 
large number of Slovene terms undergo different types of morphemic alternation which 
affect both stems and endings during the infiection; it is difficult to note these changes 
automatically. Thus, ali procedures used in developing a stop-word list and the new 
stemming algorithm will be a combination of both manual and automatic approaches. 
Ali research experiments were carried out on words from two Slovene text corpora. 
The first corpus—referred to as KNJIŽNICA (LIBRARY in English)—consisted of 
terms from fourteen different articles on librarianship and information science. The 
articles covered topics such as library education in Slovenia, the use of serial publications 
86 
in universitv and special libraries, evaluation of librarv services and the use of new 
technology in libraries, inter alia. This corpus consisted of 59,088 word tokens. 
The second corpus consisted of the word tokens that comprise Ciril Kosmaè's novel 
POMLADNI DAN (A DAY IN SPRING), published in 1953, and transformed into 
machine-readable form in 1981 by Primož Jakopin. This machine-readable text was 
chosen since the novel is widely recognized as representing the Slovene language in ali 
its beauty. The novel can thus be expected to exhibit the full range of the language's 
morphological complexity. It was assumed, therefore, that the initial list of both stop-
words and sufflxes would represent an important supplement to the lists produced 
from the words in the library and information science articles, i.e., KNJIŽNICA. The 
novel by Ciril Kosmaè, which will be referred to as POMLADNI.DAN, comprises a 
total of 62,150 word tokens, a figure which is similar to the number of word tokens in 
KNJIŽNICA. 
In addition, a third test collection was also employed in experiments, but this one 
consisted of English terms. The idea to use frequency-based techniques on an English 
text derived from a need to confront language-dependent requirements in the design 
process. The English text corpus which will be referred to as ENG.TEXT, consisted 
of 55,460 word tokens from a doctoral dissertation in the field of librarianship and 
information science (Ellis, 1987). It is thus comparable in size, in terms of the number 
of word tokens, with both of the Slovene corpora and is also comparable in subject 
matter with the KNJIŽNICA corpus. 
4.3 Development of a stop-word list 
4.3.1 Frequency distribution of terms 
It is known from the early work on automatic indexing by Luhn (1958) that the freq-
uency of occurrence of distinct words in natural language text has something to do with 
the importance of these words for purposes of content representation. Specifically, if ali 
words were to occur randomly across the documents of a collection with equal frequen-
87 
cies, it would be impossible to distinguish between them using quantitative criteria. 
Since it has been observed that the words occur in natural language text unevenly, 
they can be distinguished by their occurrence frequency. Or in other words, as noted 
by Luhn (1958): 
"The justification of measuring word significance by use-frequency is 
based on the fact that a writer normally repeats certain words as he ad-
vances or varies his arguments and as he elaborates on an aspect of a sub-
ject." (p. 160). 
In fact, it is known that when the distinct words in a body of text are arranged in 
decreasing order of their frequency of occurrence (i.e., most frequent words first), the 
occurrence characteristics of the vocabulary can be characterized by the constant rank-
frequency law of Zipf (1965) which is expressed in the following form: 
rank X frequency = constant 
That is, the frequency of a given word multiplied by the rank order of that word will be 
approximately equal to the frequency of another word multiplied by its rank. The law 
has been explained by citing a general "principle of least effort" which makes it easier 
for an author to repeat certain words instead of coining new and different words. The 
least-effort principle also accounts for the fact that the most frequent words tend to be 
short function words (AND, OF, BUT, THE, etc.) which are easy to use in text. 
Although Zipf's Law has been verified many times using text material across various 
subject areas and languages, it was again tested in this experimental work as a means 
of achieving two objectives: 
• to design a stop-word list which coidd be used in information retrieval systems in 
Slovenia; 
• to remove function words from both bodies of text in order to carry out automatic 
generation of suffixes from the remaining content-bearing words. 
The results of the application of quantitative techniques on the words from both Slovene 
text collections, i.e., KNJIŽNICA and POMLADNI.DAN, are described below. First, 
88 
some general characteristics of the frequency distribution of the Slovene words are 
observed, follovred by the testing of Zipf's Law. These results, together with the results 
obtained by the comparison of the English and Slovene text form a basis for discussion 
about the action to be taken in developing a list of Slovene stop-words. 
The occurrence characteristics of terms 
When ali words from both Slovene text databases were extracted and ranked by freq-
uency of occurrence, the following results were obtained. First, Table 4.2 clearly shows 
that both bodies of text, despite their coverage of different subject areas, produced a 
similar number of word types. 
Text corpus 
KNJIŽNICA 
POMLADNI.DAN 
Word tokens 
abs 
59,088 
62,150 
% 
100 
100 
Word types 
abs 
11,525 
10,988 
% 
19.5 
17.7 
Table 4.2: A comparison of the number of word types in databases POMLADNI.DAN 
and KNJIŽNICA 
The reason for the slightly larger number of word types in the database KNJIŽNICA 
might be the fact that this corpus consists of articles written by different authors and 
covering various specialized areas. However, both bodies of text produced a reasonable 
number of distinct terms to be included in further experiments. 
As expected, there was a large variation in the frequency of occurrence of the word 
types in text databases. If aH word types are classified into 10 groups according to 
their decreasing order, i.e., the first group consisting of the 10% of the most frequent 
words, the second group comprising the next 10% of the most frequent words, etc, the 
following table (Table 4.3) illustrates the frequency distribution of terms in the text 
collections KNJIŽNICA and POMLADNI.DAN. 
89 
Word group 
(1,152 terms or 10%) 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
Total 
KNJIŽNICA 
abs 
39,749 
5,992 
3,500 
2,309 
1,778 
1,152 
1,152 
1,152 
1,152 
1,152 
59,088 
% 
67.3 
10.2 
6.0 
3.9 
3.1 
1.9 
1.9 
1.9 
1.9 
1.9 
100.0 
Word group 
(1,099 terms or 10%) 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
Total 
POMLADNI.DAN 
abs 
45,463 
4,907 
2,853 
2,198 
1,234 
1,099 
1,099 
1,099 
1,099 
1,099 
62,150 
% 
73.1 
7.9 
4.5 
3.5 
2.0 
1.8 
1.8 
1.8 
1.8 
1.8 
100.0 
Table 4.3: Frequency distribution of terms in KNJIŽNICA and POMLADNI.DAN, 
arranged in word groups in decreasing order 
Table 4.3 shows that a very few word types provide a very high percentage of 
the observed tokens. Thus, the most frequent 10% of the word types in KNJIŽNICA 
account for 67.3% of the tokens whereas the bottom 50% account for only 9.5% of the 
tokens; the corresponding figures for POMLADNI.DAN are 73.1% and 9.0%. Ali terms 
in the bottom 50% are singletons, i.e., their frequency of occurrence is equal to 1. This 
type of frequency distribution is analogous to those observed in other languages. 
The occurrence characteristics of the most frequent words in the Slovene 
language 
Results of a detailed inspection of the group of the 10% most frequently occurring 
words from both Slovene bodies of text confirmed the expected results, i.e., terms in 
the very top of the listing were mainly function words. As an illustration, Table 4.4 
displays a list of the 20 most frequently occurring words in the corpus KNJIŽNICA. 
It can be seen from the Table 4.4 that the most frequent terms in the text corpus 
90 
Rank 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
Term 
IN 
V 
JE 
ZA 
NA 
KI 
DA 
SE 
SO 
PA 
TUDI 
Z 
S 
KOT 
KNJIŽNICE 
0 
NE 
PO 
ALI 
PRI 
Frequency (abs) 
2,113 
1,768 
1,270 
905 
790 
688 
621 
608 
599 
544 
503 
440 
385 
333 
331 
319 
310 
292 
290 
271 
Frequency (%) 
3.6 
3.0 
2.1 
1.5 
1.3 
1.2 
1.0 
1.0 
1.0 
0.9 
0.8 
0.7 
0.6 
0.6 
0.6 
0.5 
0.5 
0.5 
0.5 
0.5 
Table 4.4: A list of the 20 most frequently occurring words in KNJIŽNICA 
KNJIŽNICA are members of the following word (formal) classes: conjunction (IN; 
AND), preposition (V, ZA, NA; IN, FOR, ON) and auxiliary verb (JE, SO; IS, ARE). 
Since the main function of these words is in tying other words in sentences together, they 
are known as function words, common words or non-content bearing words. Function 
words are poor discriminators and cannot possibly be used by themselves to identify 
document content. Consequently, they are usually included in the stop-word list. 
It is interesting to note that a content-bearing word, Le., KNJIŽNICE (LIBRARIES), 
also appears in Table 4.4. Its high position derives from the fact that this list was cre-
ated on the basis of the body of text describing the subject area of librarianship and 
information science. Apart from the word KNJIŽNICE, some other terms had a very 
high frequency of occurrence, for example INFORMACIJE (INFORMATION), achiev-
ing a rank equal to 30, and UPORABNIKI (USERS), having a rank equal to 35. Since 
91 
these terms carry a very low discrimination power in the specialized databases—in this 
èase the library database—they are usually included in the stop-vvord list, and are 
known as so-called speciality words. 
A detailed inspection of the most frequent terms in the database POMLADNI.DAN 
again confirmed expected results. As can be seen from Table 4.5, ali of the 20 most 
frequent terms are non-content bearing words, thus indicating very little quantitative 
difference between terms in the two different subject areas. 
Rank 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
Term 
JE 
IN 
SE 
V 
SEM 
DA 
PA 
NE 
NA 
Z 
KI 
SO 
BI 
PO 
ŠE 
GA 
S 
NI 
ZA 
TAKO 
Frequency (abs) 
3,769 
2,453 
1,946 
1,219 
976 
880 
710 
653 
636 
575 
514 
503 
497 
442 
393 
392 
382 
343 
342 
334 
Frequency (%) 
6.1 
3.9 
3.1 
2.0 
1.6 
1.4 
1.4 
1.1 
1.0 
0.9 
0.8 
0.8 
0.8 
0.7 
0.6 
0.6 
0.6 
0.5 
0.5 
0.5 
Table 4.5: A list of the 20 most frequently occurring words in POMLADNI.DAN 
In Table 4.5, quantitative characteristics of the function words are of the particular 
evidence. The most frequent 5 terms (JE, IN, SE, V, SEM; IS, AND, WILL, IN, AM), 
representing only 0.04% of aH distinct words in the text body, account for 16.7% of 
term usage; on the other hand, as is evident from Table 4.3, the lowest 70% of the 
individual terms explain only 14.5% of term usage. 
92 
Since the frequency of term occurrence in both Slovene text collections confirmed 
a general "principle of least effort", it was assumed that the occurrence characteristics 
of both vocabularies would also correspond to the constant rank-frequency law of Zipf 
(1965). 
Zipfs Law and the Slovene language 
As already explained, Zipfs Law states that the product of the frequency of use of 
vvords and the rank order is approximately constant, i.e., F X R = C. 
Table 4.6 and Table 4.7 illustrate the results of the application of Zipfs Law to 
randomly selected vvords in the text collections KNJIŽNICA, and POMLADNI.DAN 
respectivelv. 
Term 
PA 
PRI 
INFORMACIJ 
VSE 
DEJAVNOSTI 
NISO 
NOVE 
PROBLEMOV 
DOSLEJ 
ÈASA 
DANAŠNJEM 
GRADNJI 
PRAVNO 
MODULOM 
VARSTVENIH 
SLOVENJ 
ŽIVO 
Rank 
(R) 
10 
20 
30 
40 
50 
100 
200 
300 
400 
500 
1,000 
2,000 
3,000 
4,000 
5,000 
10,000 
11,525 
Frequency 
(F) 
544 
271 
182 
130 
109 
59 
33 
24 
18 
16 
8 
4 
3 
2 
2 
1 
1 
Product of rank and 
frequency (R x F = C) 
5,440 
5,420 
5,460 
5,200 
5,450 
5,900 
6,600 
7,200 
7,200 
8,000 
8,000 
8,000 
9,000 
8,000 
10,000 
10,000 
11,525 
Table 4.6: Results of the Zipfs Law on the words from the text collection KNJIŽNICA 
Although Zipf s Law can be more or less confirmed using the frequency distribution 
93 
LOG FREOUENCV 
FIGURE 41 
Plot of rauk versus log of stem fvequency 
LOG FREO/RANK STEM DSSTRIB. (KNJIZ) 
of words in the database KNJIŽNICA, it is interesting to note that the value of the 
constant (C) increases with decreasing frequency of words; this is especially evident for 
terms in the bottom part of Table 4.6, possessing frequencies of occurrence of 2 or less. 
This is in accordance with Booth's Law (Booth, 1967) which holds for words of very 
low frequency of occurrence and not for those of high frequency. Booth's Law derives 
from the fact that when the complete word frequency count is made for a text, words 
of high rank—that is of low frequency—occur in such a way that many words have the 
same frequency. Thus, the less frequently occurring terms show considerable departure 
from Zipfs Law. 
This phenomenon may again be explained by the morphological structure of the 
Slovene language which divides the original stem into a large number of variants, thus 
dispersing the frequency of the stem occurrence over its variants. For example, in the 
text corpus KNJIŽNICA, a total of 5,235 terms out of 11,525 word types, i.e., 45.4%, 
occurred with term frequency equal to 1. A similar situation also happened in the text 
corpus POMLADNI.DAN, as shown in Table 4.7. 
Using Zipfs Law, it is possible to produce curves of the form as used by Luhn 
(1958) to define significant words in the document collection, or in other words, to 
exclude non-relevant terms. Figure 4.1 represents a plot of the logarithm of frequency 
of terms in the database KNJIŽNICA against their ranked order of frequency and, 
therefore, illustrates how such a curve characterizes the Slovene language. It has to 
be emphasized that in order to maintain exactly the same approach as described by 
Luhn (1958), the frequency of words used in the figure in fact represents the frequency 
of already conflated words, using the new Slovene stemming algorithm which will be 
described at the end of this chapter. 
The course of the curve in Figure 4.1 is very similar to those produced for various 
bodies of text, mainly for the English language (see, for example, Luhn, 1958; van 
Rijsbergen, 1979; Ashford and VVillett, 1988). According to Luhn (1958), two cutoffs, 
i.e., the first one for the high-frequency terms, and the second one for the low-frequency 
words, can be defined in such a diagram in order to remove them from the document 
94 
Term 
Z 
TAKO 
MI 
KO 
KER 
POÈASI 
BOMO 
RADA 
ÈRNO 
TEH 
OBREKARJEV 
ŠOLI 
DOMAÈEGA 
SLEKLA 
DELAVÈEK 
VIHRAL 
ŽVIŽGANJE 
Ran k 
(r) 
10 
20 
30 
40 
50 
100 
200 
300 
400 
500 
1,000 
2,000 
3,000 
4,000 
5,000 
10,000 
10,988 
Frequency 
(f) 
575 
334 
239 
196 
177 
68 
30 
22 
16 
14 
7 
4 
2 
2 
1 
1 
1 
Product of rank and 
frequency (R x F = C) 
5,750 
6,680 
7,170 
7,840 
8,850 
6,800 
6,000 
6,600 
6,400 
7,000 
7,000 
8,000 
6,000 
8,000 
5,000 
10,000 
10,988 
Table 4.7: Results of the Zipfs Law on the words from the text collection POM 
LADNI.DAN 
collection. Consequently, as concluded by Luhn (1958), since neither high- nor low-
frequency terms are good content identifiers, the remaining medium-frequency words 
can be used to identify relevant terms. 
For the purpose of developing a stop-word list to be used in Slovene information 
retrieval systems, the notion of Luhn (1958) concerning the high-frequency terms is 
of a particular interest. As evident from the results of the quantitative analysis of 
both KNJIŽNICA and POMLADNI.DAN, it is theoretically possible to create a list of 
Slovene stop-words, using the frequency-based approach. Indeed, as described below, 
information about the frequency distribution of Slovene terms served as one of the 
starting-points in the design of the stop-word list. However, the complex morphological 
structure of the Slovene language points out that additional manual involvement in the 
design process is inevitable. This notion is clarified in the section below where the main 
95 
LOG FREO/RANK TERM DISTRIB. (KNJZ) 
LOG FREO/RANK TERM BiSTR! 
LOG FREQU?NGY 
FIGURE 4.3 
Plot of i,mk versus log of temi frequency 
quantitative differences between the Slovene and English language are described. 
Cjuantitative comparison of the Slovene and English language 
Wbrd types from two text collections, the first one consisting of the Slovene terms 
(KNJIŽNICA), and the second one comprising English words (ENG.TEXT), were used 
as the basis for the quantitative comparison between these two languages. 
Text corpus 
KNJIŽNICA 
ENG.TEXT 
Word tokens 
abs 
59,088 
55,460 
% 
100 
100 
Word types 
abs 
11,525 
3,868 
% 
19.5 
7.0 
Table 4.8: A comparison of the number of word types in databases KNJIŽNICA and 
ENG.TEXT 
Table 4.8 clearly demonstrates that both databases, despite having similar dictio-
nary size and covering the same subject area, produced completely different totals of 
word types. While, on the one hand, the Slovene text corpus consisted of 11,525 word 
types, i.e., 19.5% of the total number of term occurrences, the English database con-
tained only 3,868 word types, i.e., only 7.0% of the total number of term occurrences. 
The main reason behind this striking difference is the morphology of both languages, 
of which Slovene is by far the more complex, as illustrated in Chapter 3. 
Similarly, a plot of the logarithm of the frequency of word types against their 
ranked order of frequency, applied to both languages separately, yields two different 
hyperbolic curves, as shown in Figures 4.2 and 4.3. It is interesting to note that the 
frequency distribution of terms in the Slovene text creates a more concave curve (see 
Figure 4.2) than the one produced for words in the English text (see Figure 4.3). 
The explanation can again be found in the inflectional morphology of the Slovene 
language. On the one hand, some members of word classes (prepositions, conjunctions) 
96 
are not inflected in the sentences, and these individual terms can therefore potentially 
reach a high frequency of occurrence whereas the inflected members of the word classes 
(nouns, adjectives, pronouns, etc.) have, on the other hand, theoretically less chance 
of achieving high frequency since the original stems are split during inflection into a 
large number of various terms, thus a potentially high-frequency of a given stem is 
dispersed among these variants. Whilst, for example, a non-inflected term ALI (OR) 
achieves a frequency of occurrence 290, an inflected word ZBIRKA (COLLECTION) 
appears in different variations achieving different frequencies: ZBIRK (79), ZBIRKE 
(53), ZBIRKA (16), ZBIRKO (16), ZBIRKI (10), ZBIRKAH (7), ZBIRKAMI (7), 
ZBIRKAM (2). Consequently, the Slovene language is characterized both by the small 
amount of high-frequency words which account for a large percentage of term usage, 
and by a large number of low-frequency terms. 
If ali Slovene function words were included in the high-frequency terms there wou!d 
be no difficulty in deciding upon the list of stop-words for information retrieval systems. 
A simple frequency-based approach would be used, extracting the most frequent words 
from the text and transferring them to the stop-list. However, some of the function 
words (pronouns, auxiliary verbs, etc.) are also characterized by their inflection; thus, 
no guarantee can be given that ali their variants will appear in the list of the most 
frequent words. Examples of inflectional variants of non-content bearing words that 
occur only once in KNJIŽNICA include KAK (ANY), KATERI (VVHICH), MOJ (MY), 
MOJEGA (MINE), NAM (US), NEKI (CERTAIN), TVOJ (YOURS), STE (ARE). The 
production of a stop-word list for the Slovene language thus entails a much greater level 
of detailed, manual involvement than is required for the construction of a stop-word 
list for the English language. 
4.3.2 Design of the Slovene stop-word list 
In order to achieve one of the main goals, i.e., to create a general purpose stop-word 
list for Slovene, potential candidates were extracted from the following three sources: 
97 
• a textbook, written by Toporišiè (1984), in which Slovene grammar is described 
in a highly detailed and structured way; 
• both Slovene text collections, i.e., KNJIŽNICA and POMLADNI.DAN; 
• a list of stop-words, created by Dimec (1988). 
The sections below give a brief description of how these sources were used in the con-
struction of a stop-word list. 
Slovene grammar by Toporišiè (1984) 
A textbook about Slovene grammar (Toporišiè, 1984), in particular the section on 
Slovene morphology, has proved to be an extremely valuable source in developing a 
stop-word list. Its employment was especially beneficial for: 
• defining a criterion for selection of stop-words because of its excellent description 
of the main word (formal) classes; 
• selecting candidates for the stop-word list because word (formal) classes were 
illustrated in the book with many examples. 
On the basis of this textbook, the following word (formal) classes were selected for 
inclusion in the dictionary of non-content bearing words: substantive pronoun, nu-
meral, adjective pronoun, auxiliary verb, adverb, predicate, preposition, conjunction, 
and copula. 
Since many examples were used in Toporišiè (1984) to illustrate the above word 
(formal) classes it appeared to be theoretically possible to create a preliminary list of 
stop-words. Thus, a large number of terms, belonging to the above classes, were simply 
extracted from Toporišiè (1984) and included in a dictionary. If a certain term was a 
member of the inflectional category, then ali its possible variants were produced, using, 
for example, declension, gradation, conjugation, etc. When aH these terms were merged 
into one file, a total of 1,059 distinct stop-words was produced. 
98 
A comparison of terms from this dictionary with terms created using the other two 
sources, i.e., both Slovene test collections and the stop-list developed by Dimec (1988), 
confirms the high quality of the description of Slovene morphology by Toporišiè (1984). 
A decision to use Toporišiè (1984) as a starting point in developing a stop-word list 
has, therefore, proved to be correct mainly because: 
• many new terms were discovered which did not occur in the other two sources 
(for example, BODIMO, BOSTA, MARSIKAKŠNEMU, MOJIMA, NAJINEGA, 
ÈIGAR, TISTIMA, etc); 
• a theoretical background was firmly established for the further selection and eval-
uation of non-content bearing words from the other two sources. 
A selection of stop-words from the Slovene text corpora 
It has already been emphasized that characteristics of the Slovene language enable, to 
a certain extent, the frequency approach to be employed in the design of the stop-word 
list. However, since some of the non-content bearing words belong to the inflectional 
category, not ali candidates for the negative dictionary occur among the most frequent 
words. Thus, a manual review of ali extracted words from test collections KNJIŽNICA 
and POMLADNI.DAN was inevitable in order to create a useful dictionary of stop-
words. 
The combination of the automatic frequency approach and the manual selection of 
stop-words produced the following total of candidates for the inclusion in the prelimi-
nary stop-list: 
• KNJIŽNICA: among 11,525 word types, 931 terms were found to share charac 
teristics of the stop-words; 
• POMLADNI.DAN: among 10,988 word types, 792 terms were considered as can 
didates for the stop-word list. 
99 
A stop-word list, created by Dimec (1988) 
As has already been described in the introductory section, Dimec (1988) created two 
lists of stop-words as a part of his project on the computer analysis of the Slovene lan-
guage in medicine. While the first list consisted of non-content bearing words (mainly 
function words and number terms), the second list comprised meaningful terms which 
were assessed as not being relevant to the medical language. 
Following the main objectives of our research project, the interest in the design of 
the new stop-word list focused only on the first dictionary created by Dimec (1988), 
which consisted of 1,205 distinct terms. A detailed examination confirmed the quality 
of this list, and thus almost ali stop-words were considered for their incorporation 
into the final list. There was only one exception, i.e., numerals. Since Dimec (1988) 
included in his list a large number of numerals without being consistent, a decision was 
made to consider only some basic forms of numerals which occurred frequently in both 
Slovene text databases. This decision served as a prevention against too comprehensive 
a stop-list if ali forms of ali numerals were included. 
A design of the final stop-word list 
Having produced four preliminary stop-word lists, i.e., extracting non-content bearing 
terms from Toporišiè (1984), selecting terms from two text corpora, and considering 
terms from the first stop-word list created by Dimec (1988), it was possible simply 
by merging them into one file to construct the final stop-word list. Such an approach 
resulted in a list consisting of 1,593 individual terms. To justify the employment of 
different sources and the use of both the manual and automatic involvement into the 
design process it is perhaps interesting to note that a total of 623 terms not existing in 
the first stop-word list created by Dimec (1988) was included in the final dictionary. 
The final list of the Slovene stop-words is presented in a machine-readable form on 
a floppy disk which can be found at the end of this thesis. The decision to use such 
a presentation was caused by the comprehensiveness of the stop-word list. The same 
decision was applied also to the list of suffixes. 
100 
The following is a brief description of the mam criteria which were used in the 
selection of stop-words: 
• a stop-word is defined as a non-content bearing word; 
• consequently, the members of the word classes, carrying low meaning were primar-
ily considered as candidates for the negative dictionary; these terms are mainly 
function words, and belong to prepositions, pronouns, auxiliary verbs, conjunc-
tions, etc; 
• in addition, a small core of other types of terms is also included in the list: 
- alimited number of numerals, i.e., basic forms of numerals ENA (ONE) and 
DVA (TWO); 
- a limited number of words occurring frequently in phrases, for example, 
V ZAÈETKU (IN THE BEGINNING), Z VIDIKA (POINT OF VIEW), 
VKLJUÈNO (INCLUDING), etc; 
- a small core of verbs which also appear in phrases, for example, KAŽE (IT 
SHOWS), POVE (IT SAYS), SPADA (IT BELONGS), etc. 
- some other terms, carrying extremely low meaning in sentences, for exam-
ple, DOLOÈEN (CERTAIN), NASLEDNJI (NEXT), OSTALI (OTHERS), 
PREJŠNJI (PREVIOUS), etc. 
Although the decision to include other general words with fairly low, but variable, se-
mantic contents was often considered in the design process, for instance the words 
MOŽEN (POSSIBLE), POMENI (IT MEANS), IZREDNO (EXTRAORDINARY), 
POMEMBNO (IMPORTANT), etc, the defmition of developing a general purpose 
stop-word list for the Slovene language led to the decision not to extend the stop-list 
beyond a core of function words, basic numerals, and words from phrases. Moreover, 
the final list does not include any speciality words since the negative dictionary was 
not designed with any particular textual database in mind. 
It should be noted that there are some words in Slovene that have exactly the 
same written form but differ in their meaning, i.e., they are homographs, and also in 
101 
their pronunciation, owing to the difference between vowels in the written form and in 
their vocalic sounds; for example MED can mean either BETWEEN or HONEY, VAS 
either VILLAGE or the accusative of YOU, and MORALA either MORALE or (SHE) 
SHOULD. No such vvords were included in the stop-word list since it is impossible to 
distinguish between the variants without extensive semantic processing. 
4.3.3 Evaluation of the new stop-word list 
The evaluation of the new Slovene stop-word list was carried out on a sample of 10 
abstracts (referred to as SLOV) from the articles stored in the corpus KNJIŽNICA. 
Bearing in mind the two main disadvantages of such an sample, i.e., a relativelv small 
test collection (958 terms; 6,262 characters) and its librarv content (i.e., the appearance 
of more or less the same terms as used in the text corpus KNJIŽNICA which was one 
of the main sources in the design of the stop-word list), the results of this evaluation 
should be considered as a preliminarv stage towards the evaluation on a much larger 
scale. Such an evaluation will be carried out on the basis of incorporation of both the 
stop-word list and the stemming algorithm into INSTRUCT. 
The main aim of the preliminarv stage of evaluation was to test the abilitv of the 
negative dictionarv to achieve a reasonable compression in a body of text consisting 
of 958 words. The emplovment of the elimination technique in which the words from 
abstracts were compared with a stored stop-word list, resulted in a total of 598 remain-
ing terms, or 62.4%. If the level of compression is expressed in terms of the number of 
reduced words, then 37.6% compression was achieved. This level of reduction varied 
from abstract to abstract, ranging from 31.9% to 43.7%. 
If the level of compression is estimated in terms of the reduced number of characters 
in the text corpus, then the amount of reduction of the Slovene test collection drops to 
20.2%; the elimination technique meant that the total of 6,262 characters was cut down 
to 5,002 characters, or 79.8%. The main reason for the lower percentage of reduction 
can be found in the length of the most frequent stop-words which rarelv exceeds four 
characters. 
102 
These results are comparable to the levels of compression accomplished when similar 
procedures were applied to the English text (see, for example, van Rijsbergen, 1979). 
In order to prove this similarity, the English language-based equivalent of the above 
sample was produced (referred to as ENGL) and then processed by the English list of 
stop-words. This list consists of 294 terms and actually corresponds to the list as imple-
mented within the INSTRUCT package. Table 4.9 shows the levels of compression that 
were achieved after the employment of the English negative dictionary; a comparison 
with the results obtained by the application of the Slovene stop-word list is also made. 
Text 
SLOV 
ENGL 
Terms 
958 
978 
Non-deleted 
terms 
598 
559 
Compression 
(%) 
37.6 
42.9 
Characters 
6,262 
5,482 
Non-deleted 
characters 
5,002 
4,272 
Compression 
(%) 
20.2 
22.1 
Table 4.9: The levels of compression achieved by the application of the Slovene and 
English negative dictionaries 
As can be seen in Table 4.9, both stop-word lists produced similar levels of com 
pression. The reason for the slightly larger number of deleted English terms may be 
found in the grammar; for example, articles (A, AN, and THE) are not used in the 
Slovene language at ali. 
More importantly, an inspection of the sets of words resulting from the use of the 
Slovene stop-word list showed that a successful level of indexing had been achieved. For 
example, consider the following abstract from the KNJIŽNICA corpus, which contains 
ninety-four words and 609 characters: 
UPORABNIKI IN ONLINE JAVNO DOSTOPNI KATALOG 
Jože Kokole 
Predstavljen je fenomen online javno dostopnega kataloga oziroma s kratico OPAC (po an 
gleškem online public access catalogue) pri raèunalniško podprtem poslovanju knjižnic oziroma 
knjižniènih sistemov, njegovega nastanka, razvoja in stanja v razvitih sredinah, njegovih naèel, 
103 
karakteristik in pojavnih oblik. Obdelano je še: uporaba OPAC-ov prve in druge generacije v 
posameznih knjižnicah in v vzajemnih katalogih, odnos do online bibliografskih servisov, prob 
lemi konènih uporabnikov in uporabe OPAC-a, zahteve in pogoji za oblikovanje uèinkovitega 
in uporabniško prijaznega iskalnega dialoga, perspektive za uvajanje OPAC katalogov pri nas. 
After application of the stop-word list, and reduction to single èase, the abstract 
now contains just sixty-four words and 500 characters (upper-case denotes processed 
text): 
UPORABNIKI ONLINE JAVNO DOSTOPNI KATALOG 
JOŽE KOKOLE 
PREDSTAVLJEN FENOMEN ONLINE JAVNO DOSTOPNEGA KATALOGA KRATICO 
OPAC ANGLEŠKEM ONLINE PUBLIC ACCESS CATALOGUE RAÈUNALNIŠKO POD 
PRTEM POSLOVANJU KNJIŽNIC KNJIŽNIÈNIH SISTEMOV NASTANKA RAZVOJA STA 
NJA RAZVITIH SREDINAH NAÈEL KARAKTERISTIK POJAVNIH OBLIK OBDELANO 
UPORABA OPAC-OV GENERACIJE KNJIŽNICAH VZAJEMNIH KATALOGIH ODNOS 
ONLINE BIBLIOGRAFSKIH SERVISOV PROBLEMI KONÈNIH UPORABNIKOV UPO 
RABE OPAC ZAHTEVE POGOJI OBLIKOVANJE UÈINKOVITEGA UPORABNIŠKO PRI 
JAZNEGA ISKALNEGA DIALOGA PERSPEKTIVE UVAJANJE OPAC KATALOGOV 
These first evaluation results indicate that the decision to use three different sources 
in the design of the stop-word list has proved to be correct. Both the level of achieved 
compression and the actual removal of non-content bearing terms are quite encourag-
ing evidence of the quality of the Slovene negative dictionary. Despite the fact that 
evaluation on a much larger scale is needed, the results above demonstrate that this 
list can successfully be used in Slovene information retrieval systems, and, since being 
domain-independent, employed in any textual database. 
At this point it is important to emphasize that in non-conventional retrieval systems 
the removal of stop-words from the textual databases is not an isolated action but is 
usually followed by automatic word conflation. 
104 
4.4 Design of a stemming algorithm 
The complexity of Slovene morphology suggests that it would be extremely difflcult 
to develop an effective iterative stemming algorithm. Accordingly, the use of longest-
match algorithms has been studied. 
4.4.1 Development of a suffix list 
Although it is in theory possible to design a list of endings for stemming purposes on 
the basis of prior experience of the language, the availability of a large number of word 
usages in both text collections, i.e., KNJIŽNICA and POMLADNI.DAN, suggested an-
other approach: to develop the suffix list using the information about suffixes contained 
implicitly in the word usage in the text collection. 
It has already been shown that both text corpora contained a large number of word 
types. Even the application of the new stop-word list to terms in both collections 
did not significantly reduce the total number of distinct words; in the text corpus 
KNJIŽNICA a total of 10,711 distinct terms remained (i.e., 814 terms were removed), 
and in the text corpus POMLADNI.DAN a total of 10,215 words was left (i.e., 773 
terms were excluded). 
Using the remaining distinct terms from both collections it is possible to design 
a method for the automatic generation of the suffix list. Usually, the low-frequency 
words are excluded from such a procedure since they consist of a certain amount of 
foreign names, proper names, and misspellings (see, for example, Lowe et al., 1973). 
Although such terms occurred among low-frequency terms in the Slovene corpora, they 
were not removed from further analysis because inflectional morphology produced a 
large number of relevant terms also having low frequencies. Thus, ali distinct terms 
which remained in the text collections after employment of the stop-word list were used 
for the automatic generation of suffixes. 
Automatic generation of endings was carried out separately on both text corpora. 
The following routines were written to produce a list of suffixes: 
105 
• a procedure to create a list of reversed words, sorted into alphabetical order; 
• a procedure to compare the initial characters of adjacent words in the list; 
• a procedure to merge and sort sufflxes by decreasing frequency of occurrence. 
Therefore, the starting-point for the automatic generation of suffbces was the production 
of a list of word reversals. This list showed for each word, the word reversed, the word 
itself, and its frequency of occurrence; part of the resulting list is shown in Table 4.10. 
Reversed Word 
EJNAVOKSIZAR 
EJNAVOLED 
EJNAVOLEDOS 
EJNAVOLSOP 
EJNAVOLSOPAZ 
EJNAVONEMI 
EJNAVORAV 
EJNAVORAVAZ 
EJNAVORDAK 
EJNAVOSIPDERP 
EJNAVOSIPO 
EJNAVOTEVSOP 
EJNAVOTRÈAN 
EJNAVOZEVOP 
EJNAŠANBO 
EJNAŠANV 
EJNAŠARPV 
Word 
RAZISKOVANJE 
DELOVANJE 
SODELOVANJE 
POSLOVANJE 
ZAPOSLOVANJE 
IMENOVANJE 
VAROVANJE 
ZAVAROVANJE 
KADROVANJE 
PREDPISOVANJE 
OPISOVANJE 
POSVETOVANJE 
NAÈRTOVANJE 
POVEZOVANJE 
OBNAŠANJE 
VNAŠANJE 
VPRAŠANJE 
Frequency 
4 
27 
12 
13 
5 
1 
10 
1 
3 
1 
1 
1 
11 
30 
2 
1 
34 
Table 4.10: Words and their reversed forms. 
As can be seen in Table 4.10, reversed vrords are arranged in alphabetical order; 
thus, words sharing a common suflbc, such as NAÈRTOVANJE and POSVETOVANJE, 
appear together in the list. 
On this basis, adjacent words in the ordered list were compared and whenever 
a match of N characters was found, strings containing 1, 2, ..., N characters were 
created. Thus, the words NAÈRTOVANJE and POSVETOVANJE would produce the 
106 
strings -E, -JE, -NJE, -ANJE, -VANJE, -OVANJE, -TOVANJE. Such an approach 
to the automatic generation of suffixes provided a total of 47,981 endings from the 
words in the file KNJIŽNICA, and a total of 38,846 suffixes from the words in the file 
POMLADNI.DAN. 
The procedure of sorting these trial suffixes by decreasing frequency of occurrence, 
produced a total of 7,273 distinct endings for the file KNJIŽNICA, and a total of 5,330 
distinct endings for the file POMLADNI.DAN. These results are also presented in the 
Table 4.11. 
Text corpus 
KNJIŽNICA 
POMLADNI.DAN 
Suffixes 
abs 
47,981 
38,846 
% 
100 
100 
Distinct suffixes 
abs 
7,273 
5,330 
% 
15.2 
13.7 
Table 4.11: A quantitative comparison of suffixes created from terms in databases 
POMLADNI.DAN and KNJIŽNICA 
The quantitative comparison betvveen these two lists shows very little difference 
between suffixes. Both lists of distinct endings account for a similar amount of sufRx 
usage, i.e., 15.2% in KNJIŽNICA, and 13.7% in POMLADNI.DAN respectively. More-
over, a plot of the logarithm of the frequency of suffixes against their ranked order of 
frequency also yields a similar curve for both lists, as presented in Figures 4.4 and 4.5. 
The shape of both curves is similar to diagrams produced for the frequency distri-
bution of Slovene terms. This could again serve as an indication that a small number 
of endings accounts for a large amount of sufRx usage; this point is very important for 
the design of the suffix list. 
Apart from the fact that almost no quantitative difference was found between these 
two lists of suffixes, they were also characterized by many common qualitative fea-
tures. For example, Table 4.12 displays the rank, frequency, and percentage of the 20 
107 
LOG FREQUENCY 
X 
FIGURE 44 
Plot of vauk veisua log of suffix fvequeucy 
LOG FREO/RANK SUFF1X D1STR1B. (KNJZ) 
LOG FREO/RANK SUFFK DISTR1B. CPOM.DAN) 
LOG FREOUENCY 
X 
8 
FIGURE 4,5 
Plot of rauk versus log of suffix frequency 
most frequent endings in the file KNJIŽNICA and compares them with endings in the 
file POMLADNI.DAN. This table shows that most of the high-frequency endings in 
KNJIŽNICA also occur frequently at the top of the list of endings created from the file 
POMLADNI.DAN. 
Suffix 
A 
I 
E 
0 
H 
IH 
M 
JE 
TI 
JO 
NE 
NO 
JA 
NI 
U 
NA 
V 
GA 
NIH 
EM 
KNJIŽNICA 
Rank 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
Freq 
1978 
1953 
1711 
1511 
787 
668 
593 
510 
474 
441 
435 
431 
424 
416 
382 
371 
337 
321 
319 
305 
% 
4.1 
4.1 
3.6 
3.1 
1.6 
1.4 
1.2 
1.1 
1.0 
0.9 
0.9 
0.9 
0.9 
0.9 
0.8 
0.8 
0.7 
0.7 
0.7 
0.6 
POMLADNI.DAN 
Rank 
1 
2 
4 
3 
11 
19 
7 
25 
12 
26 
22 
13 
31 
23 
15 
17 
43 
20 
60 
18 
Freq 
2275 
1816 
1305 
1355 
381 
265 
660 
210 
381 
197 
239 
354 
173 
236 
326 
280 
111 
242 
79 
272 
% 
5.8 
4.7 
3.4 
3.5 
1.0 
0.7 
1.7 
0.5 
1.0 
0.5 
0.6 
0.9 
0.4 
0.6 
0.8 
0.7 
0.3 
0.6 
0.2 
0.7 
Table 4.12: Frequency distribution of the 20 most frequent endings in KNJIŽNICA and 
their occurrence in POMLADNI.DAN 
Some interesting points can be noted in Table 4.12, one of them being the fact 
that -A, -E, -I, and -O are the most frequent endings in both lists, accounting for 
14.9% of suffix usage in the file KNJIŽNICA and, respectively, for 17.4% in the file 
POMLADNI.DAN. 
Since there was no essential difference found between both lists of endings, a decision 
108 
was made to join together ali distinct words from both text collections into one file and 
again generate a list of suffixes. This decision was based on the fact that both text 
collections had only 1,038 terms in common, and therefore, it was hoped that a larger 
vocabulary would produce a more useful list of endings. 
Thus, aH procedures for the automatic compilation of suffixes were repeated again, 
this tirne on the text corpus consisting of 19,888 distinct words; this corpus will be 
referred as SLOV. Initially, these procedures created a total number of 87,544 endings; 
when they were sorted into order and after elimination of duplicates, a list of 11,815 
distinct suffixes was obtained. 
A detailed analysis of the frequency distribution of suffixes provided some important 
results. Firstly, it was found that frequency of occurrence of trial sufflxes declined very 
rapidly with increasing rank, as illustrated in Figure 4.6 which represents a plot of the 
frequency of individual suffixes in SLOV against their ranked order of frequency. 
Secondly, a sampling from the trial suffix list showed that lower frequency trial 
suffixes were obviously not candidates for adoption in the suffix list. This point can be 
illustrated using two tables, Table 4.13 displaying the 20 most frequent endings, and 
Table 4.14 presenting 20 endings from the bottom of the trial suffbc list. 
While Table 4.13 displays a list of potentially useful endings—most of them are 
suffixes in the linguistic meaning—any employment of suffixes from Table 4.14 would 
by no means contribute to successful word conflation since they represent almost ali 
characters from the given stem. 
An inspection of the list of suffixes after they had been sorted into order of de-
creasing frequency of occurrence showed that the most frequent endings did, in fact, 
correspond to well-known suffixes. This suggests the use of a simple context-free, stem-
ming algorithm which is also known as a frequency algorithm. 
109 
LOG FREOUENCV 
X 
FIGURE 4.6 
Plot of raèk veisus log of suffix frequency 
LOG FREO/RANK SUFFIX DISTRfB. CSLOV) 
Ran k 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
Suffix 
A 
I 
E 
0 
M 
L 
H 
LA 
IH 
TI 
NO 
JE 
TJ 
NE 
NA 
NI 
LI 
JO 
JA 
EM 
Freq 
4052 
3531 
2879 
2699 
1220 
1139 
1121 
944 
900 
790 
740 
695 
681 
646 
632 
626 
624 
609 
564 
561 
% 
4.6 
4.0 
3.3 
3.1 
1.4 
1.3 
1.3 
1.1 
1.0 
0.9 
0.8 
0.8 
0.8 
0.7 
0.7 
0.7 
0.7 
0.7 
0.6 
0.6 
Table 4.13: A list of the 20 most frequently occurring endings created from words in 
the file SLOV 
4.4.2 Design of the frequency algorithm 
Since no recoding or context-sensitive rules are required for the frequency algorithm, it 
was quite easy to design such a conflation procedure and use it on the words from the 
Slovene texts. The only restriction which was placed on the suffbc removal was that 
the remaining stem should be of a minimum length of three characters. This approach 
is clearly crude in concept, as discussed by Lennon et al. (1981), but avoids the need 
for the detailed manual processing that characterizes most other ways of creating lists 
of suffixes. 
The context-free algorithm operates as follows: 
1. The suffix list is stored in reversed form and in alphabetical order. 
110 
Rank 
11631 
11632 
11633 
11634 
11635 
11636 
11637 
11638 
11639 
11640 
11641 
11642 
11643 
11644 
11645 
11646 
11647 
11648 
11649 
11650 
Suffix 
ZRAVNAL 
ZRAVNALA 
ZREDNA 
ZRL 
ZRLA 
ZTI 
ZTRGAL 
ZTRGALA 
ZTRGALO 
ZUME 
ZUMLJIV 
ZUMLJIVO 
ZVAJALCEV 
ZVALA 
ZVALI 
ZVEDB 
ZVEDBE 
ZVENEL 
ZVENIJO 
ZVIJA 
Freq % 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
0.00114 
Table 4.14: An excerpt from the list of low frequency endings created from words in 
the file SLOV 
2. The word to be conflated is read in, reversed, and the number of characters in 
the word is determined. 
3. The last letter (first letter when reversed) of the word is noted and used to address 
that portion of the suffix list that contains suffixes commencing with this letter. 
4. This portion of the suffix list is scanned to identifv the largest suffix within it 
that matches the query word. 
5. When a suffix is found that matches the end of the word, the stem length that 
would be left on removal of this suffix is examined. 
6. If this length is less than the required minimum, i.e., three characters, the length 
of the suffix is reduced by eliminating its first letter (last letter when reversed), 
111 
and the suffix list is searched again for this smaller suffix; 
7. If a suffix is found satisfying the minimum stem condition, then the suffix is 
removed, the resulting stem re-reversed and output; otherwise, no action is taken 
and the original word is output. 
Evaluation of the frequency algorithm 
Six sets of suffixes were generated, containing 100, 500, 1,000, 1,500, 2,000 and 3,000 
suffixes, and then tested on 220 variants of the eight different word stems listed in 
Table 4.15. It has to be emphasized in this context that the term "stem" is not defined 
in the pure linguistic sense, but is specified as a string of characters to which variants of 
the basic stem can be reduced without altering its meaning. Thus, for example, a stem 
of variants FINANCE and FINANÈNA is FINAN- since letters following the final -N 
differ from each other. Similarlv, variants RAZVITOST and RAZVOJ have in common 
first four strings, i.e., RAZV-, which can consequently be defined as a stem. 
The figures in this table show that, in general, the performance of the algorithm is 
far from satisfactorv. The best overall results were obtained with the list containing 
2,000 suffixes; even here, however, less than 40% of the words were confiated to the 
correct root. Particularly poor results were evident with roots having large suffixes, 
e.g., KNJIŽ, which has variants such as KNJIŽNIÈARSTVO, KNJIŽNIÈARSKEGA, 
etc. In addition, there was no consistent relationship between the size of the suffbc set 
and performance. 
An inspection of the stems resulting from the algorithm shows that both under-
stemming and overstemming have occurred. For example, when the 2,000-suffix set 
(which gave the best overall results) is applied to the 45 variants of the root RAZV, the 
algorithm produces not only RAZV but also RAZ, RAZVI, RAZVIJ and RAZVO. The 
last three stems are ali examples of understemming while the first is an example of over 
stemming, since RAZ is the beginning not only of RAZVOJ (DEVELOPMENT) but 
also of RAZLIKA (DIFFERENCE), RAZLOG (REASON) and RAZRED (CLASS), 
inter alia. The poor level of performance that is evident from Table 4.15 meant that 
112 
Stem 
FINAN 
KNJIŽ 
KNJIG 
RAZISK 
RAZV 
SPECIAL 
SPECIF 
UPORAB 
Total 
Number Of 
Variants 
19 
37 
14 
30 
43 
26 
9 
42 
220 
Number Of Suffixes 
100 
2 
2 
5 
0 
6 
7 
1 
11 
34 
500 
7 
5 
4 
10 
8 
2 
6 
19 
61 
1,000 
10 
7 
3 
12 
9 
5 
7 
19 
72 
1,500 
10 
7 
5 
11 
13 
4 
8 
20 
78 
2,000 
9 
9 
5 
10 
13 
7 
9 
21 
83 
3,000 
7 
12 
5 
8 
15 
5 
9 
18 
79 
Table 4.15: Performance of the context-free stemming algorithm using different num-
bers of suffbces. The entries in the table give the number of variants confiated to the 
correct stem (as denoted by the string in the left-hand column of the table.) 
the frequency algorithm is unlikely to achieve good results, in particular there is the 
problem of deciding upon a threshold for suffix selection. Since the employment of 
the frequency algorithm on the English-language-based bodies of text produced much 
better results than those described above (see, for example, Tarry, 1978; Lennon et al., 
1981), an experiment was carried out to find the main quantitative differences between 
English and Slovene endings. 
A quantitative comparison between English and Slovene suffixes 
An experiment to find out the main quantitative characteristics of the English and 
Slovene endings was carried out using library test collections, i.e., words from the 
files KNJIŽNICA and ENG.TEXT. Table 4.16 displays the different results that were 
obtained from the English and the Slovene test collections. 
As can be seen from Table 4.16, the Slovene body of text, owing to the complexity of 
113 
Quantitative characteristics 
Total number of terms 
Number of distinct terms 
Number of terms after 
removal of stop-words 
Total of generated endings 
Number of distinct endings 
KNJIŽNICA 
59,088 
11,525 
10,711 
47,981 
7,273 
ENG.TEXT 
55,460 
3,868 
3,625 
14,607 
2,437 
Table 4.16: Quantitative comparison of the Slovene and English texts during the process 
of the automatic generation of suffixes 
the Slovene morphologv, produced a large number of word variants, and consequently 
a large number of distinct endings, when compared to the English text corpus. This 
difference makes the frequency algorithm much more suitable to the English language, 
particularly in the process of deciding upon a threshold for suffix selection. 
A detailed insight into both lists of suffbces provides an additional argument for 
the impractical use of the frequency algorithm in Slovene retrieval systems. Table 4.17 
illustrates the frequency of occurrence of automatically generated sufflxes from the 
English and Slovene test collections according to the length of endings. 
Although the increased length of suffbces correlated in both test collections with 
the decreasing percentage of suffbc usage, as evident from Table 4.17, the following 
important differences between Slovene and English sufflxes can be noted. While in the 
English test collection endings having up to three characters accounting for 67.9% of 
the total sufflx usage, Slovene suffixes with the same length account for only 63.2% of 
the total suffbc usage. Also while the Slovene endings having four, five, six, or seven 
characters account for 33.4% of the total number of sufflxes, the English sufflxes with 
this length account for only 29.6% of sufflx usage. An important conclusion can be 
drawn from the above results. Since the Slovene language is characterized by a wider 
range of suffbces having a length of four letters or more than for English, it is extremely 
114 
Number of letters 
in the suffix 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 + 
Total 
KNJIŽNICA 
abs 
10,683 
10,393 
9,220 
7,017 
4,681 
2,805 
1,560 
765 
393 
464 
47,981 
% 
22.3 
21.7 
19.2 
14.6 
9.8 
5.8 
3.2 
1.6 
0.8 
1.0 
100.0 
ENG.TEXT 
abs 
3,602 
3,410 
2,899 
2,023 
1,249 
702 
357 
181 
96 
88 
14,607 
% 
24.7 
23.4 
19.8 
13.8 
8.6 
4.8 
2.4 
1.2 
0.7 
0.6 
100.0 
Table 4.17: Comparison of the frequency of suffix occurrence in KNJIŽNICA and 
ENG.TEXT according to the suffix length 
difficult to define an appropriate threshold for automatic suffix selection. 
Thus, any decision about the potentially useful set of endings to be employed in the 
frequency algorithm is bound, on the one hand, by a need to introduce large suffixes 
(i.e., suffbces exceeding the length of three characters), and on the other hand by the 
constant risk of overstemming. For example, adjacent words in the vocabulary, such as 
SLOJEVIT and BOJEVIT produce a highly desirable suffix -EVIT. However, one of 
its constituents substrings is also -VIT whose use is fatal for the term RAZVIT. 
Having obtained unsatisfactory results for the employment of the frequency algo 
rithm on the Slovene natural language text and defining some reasons for the failure 
of the algorithm, there was a need to introduce the second stage in the design process. 
This stage was primarily based on additional manual preprocessing, both in the con-
115 
struction of the final suffbc list, and in the formulation of context-sensitive and recoding 
rules. 
Although a need for the manual involvement was anticipated on the basis of a 
discussion about the morphological structure of the Slovene language, as presented 
in Chapter 3, the low level of performance of the frequency algorithm necessitated 
a rethinking in the design of the new Slovene stemming algorithm. The only benefit 
achieved from the development of the frequency algorithm was a list of the most frequent 
endings which served as a starting-point for the design of the final suffix list and for 
the specification of the context-sensitive and recoding rules. 
4.5 Design of the new stemming algorithm for the Slovene 
language 
4.5.1 Development of the suffix list 
The second algorithm was developed using the traditional, trial-and-error approach 
that characterizes most context-dependent algorithms. The process started by taking 
the 200 most frequent word endings that had been identified in the previous work and 
determining what extensions and rules needed to be created to allow them to stem 
correctly the 10,711 content-bearing words from the KNJIŽNICA corpus. The utility 
of each of these endings as a suffix was tested by seeing whether its removal would result 
in either understemming or overstemming. Consideration was given as to the minimum 
stem length vvhich should be left after the removal of a given ending, of new endings 
which needed to be added to the suffix list or endings which needed to be removed from 
it, and of the context-sensitive and recoding rules needed for accurate conflation. It was 
often the èase that selection of one suffix would require the adoption or removal of other 
suffixes, or the addition of context-sensitive rules in order to maintain consistency; this 
behaviour is, of course, characteristic of ali languages and not specific to Slovene. 
The following major problems were encountered: 
116 
• Selection of a minimum stem length. Most of the shorter words in Slovene preserve 
their meaning when only three characters remain (consider, for example, BAZ-a, 
RUS-ija, etc). Some terms, however, in particular those characterized by different 
types of morphemic alternation, require a minimum stem length of four characters 
(consider, for example, variants KADER and KADRA; their reduction to the stem 
KAD- would cause serious overstemming, i.e., KAD- could convey the meaning 
ofbothSTAFFandTUB). 
• It is sometimes necessary to take account of the characters immediately preceding 
an ending when deciding whether it should be removed from a word. For exam-
ple, the identification of the very common suffix -BA should normally result in 
its removal, e.g., TELOVADBA, ODREDBA, POŠKODBA, IZOBRAZBA, etc. 
However, it should not be deleted when preceded by the character V to avoid over 
stemming: otherwise, e.g., the word STAVBA would be stemmed to STAV, which 
is the beginning of a large number of words, e.g., STAVA, STAVEC, STAVEK, 
STAVKA, etc. 
• The wide range of morphemic alternations that occur during word formation 
and inflection in the Slovene language requires the use of very extensive recod-
ing rules, much more so than is the èase in English. Consider, for example, 
the following pairs of related words: OBSEG and OBSEŽEN, PREDLOG and 
PREDLAGATI, NAGRAJEVATI and NAGRADA, PODREJEN and PODRED 
ITI, ODGOVARJATI and ODGOVOR, TEHNIÈNI and TEHNIK, IZOBRAZBA 
and IZOBRAŽEVANJE, DOKAZ and DOKAŽE, ARKTIÈNI and ARKTIKA, 
INTUICIJA and INTUITIVNOST, REGIJA and REGIONALNI, CITIRATI and 
CITAT, JETRA and JETER, TRG and TRŽEN, GESEL and GESLO, etc. Such 
examples are not amenable to conflation just by the deletion of the word ending; 
instead, a complex set of recoding rules is required. 
As a result of the manual selection of suffixes, after consideration of the conditions 
for suffix removal, a list containing 2,086 endings was produced. An inspection of these 
suffixes showed that some of their inflectional variants were stili missing. Thus, in order 
117 
to obtain a reliable and comprehensive final suffix list, ali possible variants of each suffix 
were additionallv produced, using declension, conjugation, etc. 
The resulting longest-match, context-sensitive algorithm is based on the use of 5,276 
endings, each of which has an associated minimum stem length, either three or four 
characters, and one of eight action codes, which implement the context-sensitive rules. 
The final list of suffbces can be also found in a machine-readable form on a floppy disk. 
While the list of stop-words is contained in the file STOP.TXT, a list of endings is 
included in the file SUFFIX.TXT. Table 4.18 shows an excerpt from the list of endings. 
Suffix 
AVAM 
AVAMA 
AVAMI 
AVANJ 
AVANJA 
AVANJ E 
AVANJEM 
AVANJEMA 
AVANJI 
AVANJIH 
AVANJU 
AVATI 
AVCA 
AVCE 
AVCEM 
AVCEMA 
AVCEV 
AVCI 
AVCIH 
AVCU 
Action code 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
2 
2 
2 
2 
2 
2 
2 
2 
MIN stem length 
4 
4 
4 
3 
3 
3 
3 
3 
3 
3 
3 
3 
4 
4 
4 
4 
4 
4 
4 
4 
Table 4.18: An excerpt from the final list of suffixes 
As can be seen from Table 4.18, two digits are appended to each suffix. The first 
digit is called an action code which specifies conditions for the suffix removal; the 
second digit defines the minimum stem length, i.e., prevents the removal of the suffix 
if the resulting stem were less than the specified minimum length. In some cases the 
118 
minimum stem length is three and in other cases four characters. The employment of 
this list of suffixes by the new algorithm and explanation of some of the recoding rules 
are described in the section below. 
4.5.2 The new stemming algorithm for the Slovene language 
The stemming algorithm consists of two main parts: a basic stemming procedure and 
the recoding procedure. Both procedures are employed within the algorithm as follows. 
The basic stemming procedure consists of a single-pass suffix deletion process; that 
is, one pass is made through a list of suffixes, and if a match is encountered, a deletion 
is made. The comparison proceeds in a longest-first sequence, to avoid incomplete 
truncation of compound suffixes. For example, as -SKEGA and -EGA are both in the 
suffix list, then the comparison should not be made with -EGA first, since -SKEGA 
would then never be detected. 
Suffixes that are to be deleted appear in the suffix list, containing 5,276 endings. 
Appended to each suffix are an action code and a minimum stem condition. A list 
of sufRxes in Table 4.19 illustrates ali potential occurrences of both codes. These 
examples can serve as a basis for the explanation of the basic stemming procedure in 
the algorithm. 
The first context-sensitive rule in the basic stemming procedure relates to the min 
imum stem length which should be left after the removal of the suffix. The minimum 
stem length can consist either of three or four characters. Although the majority of 
suffixes have the appended condition that the remaining stem should be of a minimum 
length of three characters, the morphological characteristics of some words required the 
minimum stem length to be extended to four characters and that condition is added 
to a certain number of suffixes. For example, a deletion of the suffix -ETI from the 
word ŽIVETI results in a stem ŽIV-; to prevent a potential overstemming if the suffix 
-AL were stripped from the word ŽIVAL, a code for the minimum stem length of four 
characters was attached to the suffix -AL to produce the unchanged stem ŽIVAL. 
When a condition for the minimum stem length is satisfied in the algorithm, then 
119 
Suffix 
AL 
ALNA 
ATA 
BA 
EK 
EM 
ENE 
ETI 
IZACIJA 
KACIJA 
NA 
OV 
Action code 
3 
2 
3 
5 
2 
8 
6 
3 
4 
1 
2 
7 
MIN stem length 
4 
3 
4 
3 
4 
4 
3 
3 
3 
3 
4 
3 
Table 4.19: Examples of suffixes and their digits codes 
one of eight courses of action is followed, as determined by the action code associated 
with each sufRx. These codes are as follows, with examples of their application in 
brackets: 
1. Delete the terminal character of the word being processed which matches the 
suffix list entry (suffix -KACIJA: KLASIFIKACIJA to KLASIFI); 
2. Delete as above if and only if the character preceding the matching entry in the 
word is a consonant (suffix -ALNA: NACIONALNA to NACION but SOCIALNA 
to SOCIAL); 
3. Delete as in code 2, but not if there are two successive consonants preceding the 
matching entry (suffix -ATA: KANDIDATA to KANDID but KOLOVRATA to 
KOLOVRAT); 
4. Delete as in code 2, but not if the consonant preceding the matching entry is 
-R (suffbces -NA, -ACIJA: POLARNA to POLAR but POLARIZACIJA to P0-
LARIZ); 
5. Delete as in code 2, but not if the consonant preceding the matching entry is -V 
120 
(suffixes -EK, -BA: STAVEK to STAV but STAVBA to STAVB); 
6. Delete as in code 2, but not if the consonant preceding the matching entry is -M 
and is the third letter in the word (suffix -EME: VODENE to VOD but ZAMENE 
to ZAMEN); 
7. Delete as in code 2, but not if the letters preceding the matching entry are either 
-SL, -BN, or -SN (suffix -OV: STANDARDOV to STANDARD but PRISLOV, 
OBNOV and OSNOV remain unchanged); 
8. Delete as in code 2, but not if the letters preceding the matching entry are either 
-BL or -ST (suffix -EM: HITREM to HITR but PROBLEM and SISTEM remain 
unchanged). 
Once an ending has been removed, the recoding procedure takes plaèe. This consists of 
the following three steps, which are carried out in sequence: 
1. A stem dictionary of six special cases is checked to see whether any of the trans-
formations shown in Table 4.20 should be applied. 
Stem 
TIS 
TRŽ 
RAZVIJ, RAZVIL, RAZVIT 
VZGAJ 
KEMIÈ, KEMIK 
LOGIK 
Recoded Stem 
TISK 
TRG 
RAZVOJ 
VZGOJ 
KEMIJ 
LOGIÈ 
Table 4.20: Stem dictionary of special cases. 
2. A total of 20 recoding rules are applied, as shown in Table 4.21. 
3. The E ~ zero alternation, which has been mentioned in Section 2, is attended to, 
as shown in Table 4.22. This table considers only the E ~ zero alternation and 
121 
Recoding Rule 
Ending 
-SEŽ, -SEÈ 
-LAG, -LOG 
-GRAJ 
-REJ 
-GOVAR 
-NAŠ, -NOS 
-NIŠ, -NIÈ 
-IŠ 
-BRAŽ 
-KAŽ 
-TIÈ 
-UIT 
-ION 
-ÈAN 
-NAC 
-UŠ 
-VIR 
-STAJ, -STAL 
-STAT, -STOJ 
-SAB 
-TIR 
Recoded Ending 
-SEG 
-LOŽ 
-GRAD 
-RED 
-GOVOR 
-NES 
-NIK 
-IS 
-BRAZ 
-KAZ 
-TIK 
-UIC 
-IJ 
-ÈIN 
-NIR 
-US 
-VOR 
-STAN 
-STAN 
-SOB 
-TAT 
Example 
Stem 
PRESEŽ, PRESEÈ 
PREDLAG, PREDLOG 
NAGRAJ 
PRIREJ 
ZAGOVAR 
VNAŠ, VNOS 
TEHNIŠ, TEHNIÈ 
IZKORIŠ 
IZOBRAŽ 
DOKAZ 
OPTIÈ 
INTUIT 
REGION 
OBÈAN 
SANAC 
POSKUS 
IZVIR 
OBSTAJ, OBSTAL 
OBSTAT, OBSTOJ 
USPOSAB 
CITIR 
Recoded Stem 
PRESEG 
PREDLOŽ 
NAGRAD 
PRIRED 
ZAGOVOR 
VNES 
TEHNIK 
IZKORIS 
IZOBRAZ 
DOKAZ 
OPTIK 
INTUIC 
REGIJ 
OBÈIN 
SANIR 
POSKUS 
IZVOR 
OBSTAN 
OBSTAN 
USPOSOB 
CITAT 
Table 4.21: Recoding rules for 20 word endings after removal of the initial suffix. 
takes no account of the A ~ zero, I ~ zero and O ~ zero alternations that also 
occur; hovvever, these are encountered much less frequently and some of the more 
commonly occurring alternations here are encompassed by the recoding rules of 
Table 4.21. 
Having described the main components of the algorithm, the actual implementation 
is as follows: 
1. The suffix list is stored as reversed suffixes, in alphabetical order; stems and 
suffbces from the recoding list are also reversed. 
2. The maximum suffix length is noted (call this MAX). 
122 
Recoding Rule 
Ending 
-consonant + -R 
-consonant + -N 
-consonant + -L 
-consonant + -M 
Recoded Ending 
-consonant + -ER 
-consonant + -EN 
-consonant + -EL 
-consonant + -EM 
Example 
Stem 
KADR 
JAVN 
GESL 
POJM 
Recoded Stem 
KADER 
JAVEN 
GESEL 
POJEM 
Table 4.22: Recoding rules for use with the four sonorants. 
3. The word to be conflated is read and the number of characters counted; if the 
length of the word is less than three characters, it is output without any change, 
otherwise it is reversed. 
4. The last MAX characters of the word (first MAX when reversed) are taken as 
a potential sufRx. 
5. The current potential suffix is searched for in the list of sufRxes. 
6. This search can have one of three possible outcomes: 
• The sumx is found. In this èase, the minimum stem length condition and 
the action code are examined. If these are satisfied, then the ending is 
removed from the word. If they are not satisfied, then two courses of action 
are possible, these depending on the length of the remaining ending. If 
this is greater than one character, the first character (last when reversed) is 
eliminated and a new ending created; the algorithm is re-entered at Stage 
5; alternativelv, if the ending is now one character long, then the word 
to be stemmed remains unchanged and is sent to the recoding part of the 
algorithm. 
• The sufRx is not found and it is more than one character long. In this 
èase, the first character (last when reversed) is eliminated and a new suffix 
123 
created; the algorithm is re-entered at Stage 5. 
• The suffix is not found and it is one character long. In this èase, the word 
to be stemmed remains unchanged. 
7. The three main parts of the recoding procedure are carried out in sequence on 
the stem, as detailed in Tables 4.20, 4.21 and 4.22 
8. The stem resulting from the above transformations is re-reversed and output. 
Evaluation of the stemming algorithm 
To obtain a preliminary indication of how successful the design of the algorithm was 
and vvhether any major changes in the algorithm were necessary before its incorporation 
into INSTRUCT, a simple test was nrst carried out. The percentage of compression 
and the quality of the algorithm were measured and tested by applying the algorithm 
to 83 abstracts from the collection of documents that is fully described in Chapter 6. In 
order to acquire more reliable results, the idea was therefore to use the larger number of 
word types than those contained in 10 previously used abstracts from the KNJIŽNICA 
corpus. The results of this preliminary evaluation are discussed in the section below. 
The level of compression achieved by the algorithm. After stop-words had 
been deleted from this collection, a total of 2,616 distinct word types was obtained. 
The emp!oyment of the stemming algorithm reduced the number of these words to 
1,184 distinct stems, or 45.3%. If the level of compression is expressed in terms of the 
number of reduced words, then 54.7% compression was achieved. This figure is much 
higher than 12% as reported by Dimec (1988). It is also higher when compared to the 
results of different English stemmers where the level of compression ranged from 26.2% 
to 50.5% (Lennon et al., 1981). There is no doubt that the Slovene stemming algorithm 
contributes significantly to the reduction of the body of text. These results also indicate 
that the algorithm is based on the use of a "strong" stemmer. Since "strong" stemming 
can hurt performance (Harman, 1991) it was interesting to observe how successfully 
124 
the stemming algorithm performed in terms of its ability to reduce word variants to 
the common stem; this performance of the Slovene stemmer is discused below. 
Results of word conflation achieved by the Slovene stemmer. A list of 2,616 
distinct word types and the resulting 1,184 distinct stems was given to a trained inter-
mediary for the additional assessments. In other words, the professional intermediary 
was asked to control manually the results of the automatic word conflation. His main 
task was in discovering two main types of errors, i.e., under- and overstemming. 
A total of 109 errors was reported by the professional intermediarv. The success 
rate of sufRx stripping was therefore 90.8%. These results are in accordance with 
Porter (1980) who emphasized that success rate is always significantly less than 100%, 
irrespective of how the process is evaluated. In order to test this statement, the En-
glish language-based equivalents to the above 83 abstracts were processed by Porter's 
algorithm. It is interesting to note that after English stop-words have been deleted 
from this document collection, a total of only 1,250 distinct word types was obtained. 
This again indicates the richness of the Slovene morphology; this difference is further 
discussed in Chapters 6, 7, and 8. In addition, the employment of the stemming algo 
rithm reduced the total number of 1,250 word types to 1,065 distinct stems, or 85.2%. 
In other words, only 14.8% level of compression was achieved by Porter's algorithm. 
However, the results of experiments described in the next three chapters confirmed the 
assumption that the larger the dictionary size the greater the compression achieved. 
After Porter's algorithm has been applied to 4,756 word types within the INSTRUCT 
database, 36.7% level of compression was produced. 
More importantly, the additional manual assessments of the resulting stems reported 
a total of 93 errors, leading to the 8.7% error rate. A summary of the results produced by 
the application of the Slovene (referred to as SLOV) and Porter's stemming algorithm 
(referred to as ENGL) is given in Table 4.23. 
It is evident from Table 4.23 that both stemming algorithms were able—despite 
the huge difference in terms of the text compression—to produce a similar success rate 
125 
Text 
SLOV 
ENGL 
Distinct 
terms 
2,616 
1,250 
Distinct 
stems 
1,184 
1,065 
Compression 
(%) 
54.7 
14.8 
Error rate 
(%) 
9.2 
8.7 
Table 4.23: Comparison of the Slovene and Porter's algorithms 
for suffix stripping. While the Slovene stemming algorithm achieved a 90.8% success 
rate, its English counterparfs success rate was only slightly better, i.e., 91.3%. These 
findings correspond to the results of experiments carried out by Lennon et al. (1981). 
This research group demonstrated that there was no relationship between the strength 
of an algorithm and the consequent retrieval effectiveness (see also Keen, 1991b). 
The results of the above experiment revealed that both stemming algorithms were 
vulnerable to a certain amount of errors produced during word conflation. The fol-
lowing are some examples of the unsuccessful word conflation achieved by the Slovene 
stemming algorithm (errors produced by the application of Porter's algorithm will be 
presented in Chapter 8): 
• overstemming: BESED - besed, besedila; CENT - center, centralizirana; GLAV 
- glavo, glavni; GRAD - gradivo, gradnja, grajski; LIK - lik, likovni; NEM -
nem, nemški; ODGOVOR - odgovoriti, odgovorna; OSEB - osebne, osebnosti; 
PODOB - podoba, podobni; POT - poteza, poti; PRAV - pravila, pravica; PROS 
- prosti, prostor; ROM - romih, romane; UMET - umetne, umetnost; 
• understemming: avtomat - avtomatiz; dopol - dopoln; infor - inform; integ -
integrir; instit - institut; jas - jasen; natan - natanè; poudar - poud; prim -
primer; prog - program; razvitem - razvoj; tisk - tiskov. 
Despite the 9.2% error rate, the results of this simple test have indicated that proce-
dures used in the stemming algorithm are workable and will yield good results with 
126 
only minor changes. Although these alterations might involve the list of endings and 
occasionally the context-sensitive and recoding rules, the basic principles of the new 
Slovene stemming algorithm remain the same. Good performance results are demon 
strated below. 
For example, the abstract listed previously resulted in the following stemmed text 
representative: 
UPORAB ONLIN JAVEN DOSTOP KATAL 
JOŽE KOKOLE 
PREDSTAV FENOM ONLIN JAVEN DOSTOP KATAL KRAT OPAC ANGL ONLIN 
PUBLIC ACCES CATALOGUE RAÈUNAL PODPR POS KNJIŽ KNJIŽ SISTEM NASTAN 
RAZVOJ STAN RAZVOJ SREDIN NAÈEL KARAKTER POJAV OBLIK OBDEL UPORAB 
OPAC GENER KNJIŽ VZAJEM KATAL ODNES ONLIN BIBLIOGRAF SERVIS PROBLEM 
KON UPORAB UPORAB OPAC ZAHTEV POGOJ OBLIK UÈIN UPORAB PRIJAZ ISKAL 
DIAL PERSPEK UVAJ OPAC KATAL 
Apart from producing useful stems—as shown above—the algorithm has also nicely 
captured the characteristics of the Slovene morphology. In other words, the algorithm 
has demonstrated that it can conflate most of the "difficult" word variants to the same 
stem. Table 4.24 illustrates the performance of the algorithm, using some examples. 
The listing consists of four basic stems, having in total 29 different variants as they 
appeared in the document collection. Ali words are organized in three columns. The 
first column shows the original word, the second column the initial stem produced after 
the basic stemming procedure, and the third column the recoded stem, i.e., the final 
output. 
Each of the examples in Table 4.24 illustrates the employment of both the basic 
stemming procedure and recoding rules. For example, stems CITAT- and REGIJ-
illustrate alterations occurring at the end of the stems after the use of recoding suffixes; 
stem GESEL- describes the insertion of -E between the consonant and -L at the end of 
the stem; a stem KEMIJ- represents a type of correction which is carried out when a 
list of stems to be changed is employed. 
127 
Term 
CITATIH 
CITATOV 
CITIRAMO 
CITIRANA 
CITIRANEGA 
CITIRANEM 
CITIRANI 
CITIRANIH 
REGIJA 
REGIJI 
REGIONALNE 
REGIONALNE 
REGIONALNI 
REGIONALNIH 
REGIONALNO 
GESEL 
GESELSKI 
GESLA 
GESLI 
GESLO 
GESLOM 
GESLU 
KEMIJA 
KEMIJE 
KEMIJSKA 
KEMIJSKE 
KEMIJSKO 
KEMIKI 
KEMIÈNIMI 
Initial stem 
CITAT 
CITAT 
CITIR 
CITIR 
CITIR 
CITIR 
CITIR 
CITIR 
REGIJ 
REGIJ 
REGION 
REGION 
REGION 
REGION 
REGION 
GESEL 
GESEL 
GESL 
GESL 
GESL 
GESL 
GESL 
KEMIJ 
KEMIJ 
KEMIJ 
KEMIJ 
KEMIJ 
KEMIK 
KEMIÈ 
Final stem 
CITAT 
CITAT 
CITAT 
CITAT 
CITAT 
CITAT 
CITAT 
CITAT 
REGIJ 
REGIJ 
REGIJ 
REGIJ 
REGIJ 
REGIJ 
REGIJ 
GESEL 
GESEL 
GESEL 
GESEL 
GESEL 
GESEL 
GESEL 
KEMIJ 
KEMIJ 
KEMIJ 
KEMIJ 
KEMIJ 
KEMIJ 
KEMIJ 
Table 4.24: Recoding strategies and suffix deletion applied to some examples 
128 
Finally, the effectiveness is also demonstrated by reference to the eight sets of word 
variants that have been discussed previously in the evaluation of the earlier, context-free 
algorithm. When the context-sensitive algorithm described in this section was used, ali 
of the 220 variants were correctly conflated to the stem that is shown in Table 4.15. 
However, to obtain the final results about its retrieval performance the first objective 
was to incorporate the Slovene stemming algorithm into INSTRUCT. The effectiveness 
of the routines developed here will be evaluated by running searches on a Slovene 
database; these results will be then compared with results obtained by applying the 
manual right-hand truncation and non-conflation to words in queries. 
129 
Chapter 5 
INSTRUCT: an INteractive 
System for Teaching Retrieval 
Using Computational 
Techniques 
5.1 Introduction 
Schools of librarianship and information studies have been an important source for the 
development of alternative searching techniques. As a part of information retrieval 
courses, many teaching aids have been designed in the past to simulate searching on 
databases (Wood, 1984). However, ali simulation models have been developed to fa-
miliarize students with conventional, Boolean-based retrieval svstems. Thus, the lack 
of a teaching aid which would help students to become familiar with advanced retrieval 
techniques has become more than evident. 
It is preciselv this fact which has led to the design of INSTRUCT (INteractive 
Svstem for Teaching Retrieval Using Computational Techniques) at the Department 
of Information Studies, Universitv of Sheffield. INSTRUCT is an interactive svstem 
130 
which has been developed mainly to enable students of librarianship and information 
studies to become familiar with the new generation of computerized, statistically-based, 
retrieval systems (Willett and Wood, 1989). INSTRUCT is now being used for this 
purpose in educational organizations both in the UK and abroad. In addition to its 
use as a teaching resource, INSTRUCT has also proven to be a very useful test bed for 
investigating a range of research problems encountered in information retrieval. 
This chapter will be concerned mainly with a description of INSTRUCT. Firstly, 
a brief outline of the main components of the original version of INSTRUCT (Hendry 
et al., 1986a,b) will be given, followed by an explanation of modules which were added 
to INSTRUCT (Wade and Willett, 1988). This will form a basis for the description 
of INSTRUCT both as a teaching resource and as a test bed for research problems. 
The chapter will be concluded by a brief summary of the main modifications to the 
original version of INSTRUCT which were needed to develop a new, Slovene version of 
INSTRUCT. 
5.2 Original version of INSTRUCT - program facilities 
The original version of INSTRUCT was implemented in the Summer of 1985 on a 
PRIME 750 minicomputer under the PRIMOS operating system. This version incor-
porates the following facilities: natural language query processing (including the elim-
ination of stop-words, automatic word conflation and automatic identification of word 
variants), best-match searching, relevance feedback searching, and Boolean searching. 
This version of INSTRUCT runs against a search file that comprises 6004 documents 
from the 1982 additions to the Library and Information Science Abstracts (LISA) 
database. Each of the records in this search file contains the accession number, the 
title and the abstract of a document; the shortage of disk space prevented the inclusion 
of the full citation data. The retrieval of these records is based on the occurrence of 
terms in titles and abstracts of documents. INSTRUCT ušes the conventional inverted 
file structure. 
131 
5.2.1 The user interface 
In the original, i.e., PRIME version of INSTRUCT, the following two types of user 
interface are available: 
• the "novice" interface; 
• the "experienced user" interface. 
The "novice" interface is completely menu-driven and provides the user with a great 
deal of explanatory information at ali stages during the search. The "experienced 
user" interface is also menu-driven, but presents the user only a limited amount of 
explanatory text. 
5.2.2 Query formulation 
The query module of INSTRUCT allows the user to input a natural language need 
statement as a basis for the query to be searched. Thus, no Boolean operators need to 
be specified (although this can be done later if a Boolean search, rather than a best-
match search, is required). The key terms in the query list are identified after non-
content bearing words (e.g., AND, BUT, ARE) have been eliminated. Those terms not 
found in the stop-word list are then stemmed using an algorithm suggested by Porter 
(1980). The resulting list of stems is then shown with the corresponding frequencies 
of occurrence in the 1982 LISA database. The user can change the query by selecting 
one of three options: the addition, deletion, or expansion of query terms. The addition 
or deletion of terms in the query list is achieved very simply by updating the data 
structure in which the current form of the query list is stored. 
The third option, i.e., the ezpansion of a query term, is based on a measure of the 
similarity between strings of text. The assumption for this type of term expansion is 
that the character structure of the word is related to its meaning so that it can be used 
as a basis for classifying that word. The measure of similarity is based on the number of 
trigrams (i.e., three character substrings) common to the selected keyword and each of 
132 
the stems in the dictionary component of the inverted file. Similaritv is calculated using 
the highlv efRcient best-match searching algorithm suggested by Noreault et al. (1977). 
The stems in the dictionary file are then sorted into order of decreasing similarity with 
the selected query stem. The ten most similar stems are then displayed to the user for 
possible inclusion in the query list. 
5.2.3 Searching 
Once a set of keyword stems has been obtained that adequately describes the query, 
the user is in a position to carry out a search of the database. There are two main types 
of search strategy implemented in the original version of INSTRUCT: the best-match 
search and the Boolean search. 
Best-match searching 
A nearest neighbour, or best-match, search procedure forms the basic searching mecha 
nism available in INSTRUCT. The following is the summary of the main actions which 
take plaèe in best-match searching: the presence or absence of each of the query terms 
in each of the documents in the database is noted (documents having no similarities 
with the query are eliminated), the sum of the weights for these terms is calculated, 
and then these sums of weights are sorted so as to obtain a ranking of documents. 
The best-match searching procedure in INSTRUCT is based on the algorithm sug 
gested by Noreault et al. (1977), and discussed further by Perry and Willett (1983). 
The weights used in INSTRUCT to reflect the discriminating power of individual query 
terms are the modified inverse document frequency (IDF) weights, first suggested by 
Croft and Harper (1979). 
After performing an initial, best-match search, the INSTRUCT user has an oppor-
tunity to perform further searches based on modified term weights in an attempt to 
improve retrieval performance. This searching mechanism is described in the section 
belovv. 
133 
Relevance feedback searching 
As each document is displayed on the screen in ranked order, the user is asked to 
state whether or not the retrieved document is relevant to the query. The relevance 
information provided by the user then forms the basis for the modification of the term 
weights in order to obtain a better ranking of the remaining, uninspected documents 
in the collection. 
The relevance feedback technique used in INSTRUCT is based on the approach first 
suggested by Robertson and Sparck Jones (1976), which takes account of the presence 
or absence of query terms in both relevant and non-relevant documents. 
Possible alternatives to the relevance feedback search are to go further down the 
initial ranked list, to modify a query by the addition, deletion, or expansion of terms, 
or to carry out a Boolean search as described in the next section. 
Boolean searching 
The conventional Boolean search based on the AND, OR, and NOT operators (but 
without proximity searching) has been included in INSTRUCT mainly to allow students 
to compare the Boolean retrieval mechanism with the best-match searching facilities. 
However, INSTRUCT also allows the user to receive a ranked Boolean output, a 
facility that is available in some free-text retrieval packages (Kimberley, 1987). This is 
done by the user specifying Boolean constraints that must be satisfied by the output 
from the best-match search. The aim of this hybrid search is to exclude from the 
ranking those documents which do not fulfill the Boolean constraints. 
5.3 Enhancements to INSTRUCT 
This section will briefly describe the new version of INSTRUCT, which includes three 
main additional modules: query expansion based on co-occurrence data, cluster-based 
searching, and browsing. The implementation of these new components was in line 
134 
with the objective of providing students with live "hands-on" experience of searching 
using advanced methods in IR that are not otherwise generally available, i.e., that are 
described only in research articles and monographs. 
This version of INSTRUCT has also been written in the standard PASCAL pro-
gramming language; it runs on an IBM 3083 mainframe under the VM/CMS operating 
system. The search file contains the 26,280 records comprising the 1982-1985 input 
to the Library and Information Science Abstracts (LISA) database. As with the orig 
inal version, the IBM version of INSTRUCT is based on a conventional inverted file 
structure. 
5.3.1 The user interface 
In addition to the "novice" and "experienced user" interfaces which are available in 
the original version of INSTRUCT, an "expert" command-driven interface has been 
incorporated into the IBM version of INSTRUCT. This interface—in which help infor-
mation is available only as a specific option—has been introduced to reflect the needs 
of students and members of staff who are familiar with the techniques demonstrated 
in INSTRUCT. These users see the system as a rapid way of getting references, rather 
than as a way of learning about advanced retrieval research. 
5.3.2 Query expansion on the basis of term co-occurrence 
Besides the expansion module based on the string similarity measure (i.e., the number 
of trigrams in common), the IBM version of INSTRUCT contains an expansion routine 
which makes use of term clustering techniques. Here, the identification of the most 
similar stems to a selected query term is based on the extent to which a stem co-occurs 
with a query term throughout the database. 
The algorithm which has been used for the calculation of the inter-keyword simi-
larities is one proposed by Willett (1981); this is derived from an earlier algorithm for 
the calculation of query-document similarities which had been proposed by Noreault 
et al. (1977). The algorithm ušes the inverted file to identify ali of the documents in 
135 
which a given term occurs. 
Since the database in INSTRUCT is static, the 20 nearest neighbours for each of the 
keyword stems were calculated in a single, batch run and then stored in a file to achieve 
quick response. Thus, after 20 terms which are judged to be most similar to the chosen 
query stem are displayed for inspection, the user can include any stem in the query list. 
The algorithm used here is efficient in operation since it removes the need to calculate 
similarities between pairs of stems which do not co-occur in any of the documents in a 
database. This algorithm is probably the most efficient currently available but would be 
very demanding of computational resources if used for the interactive Identification of 
related stems, particularly in the èase of very frequently occurring stems, where many 
document vectors need to be added together and where a disc access would probably be 
required for each and every occurrence of the keyword in the database (Willett, 1981). 
A more efficient technique, derived from the algorithm used here, has been suggested 
by Noreault and Chatham (1982) which might be sufficiently fast for interactive use. 
However, the algorithm can only be used with low frequency keywords; moreover, the 
similarities are not completely accurate since they are calculated by means of a sampling 
procedure (Wade and VVillett, 1988). 
5.3.3 Cluster-based searching 
A third search option available in the IBM version of INSTRUCT (besides the best-
match and Boolean searching) involves the clustering of the documents in the database. 
This procedure ušes the concept of a nearest neighbour cluster, or NNC, as discussed 
by Griffiths et al. (1984), which has been shown to be of general applicability. 
5.3.4 Post-search options 
The post-search mechanism in the IBM version of INSTRUCT contains the following 
modules: 
136 
• constraining the results of a best-match search or of a cluster search using Boolean 
logic (hybrid search); 
• browsing; 
• performing feedback searches. 
Hybrid searching 
In the IBM version of INSTRUCT, the hybrid search is carried out by eliminating from 
the initial ranking (achieved by a best-match search, or cluster search) any documents 
which fail to satisfy the Boolean constraints. These can be imposed and released 
any number of times after the initial ranking has been produced so that the user can 
experiment with a number of different sets of constraints. 
Browsing 
The importance of broivsing in information retrieval systems has been widely recog-
nized (Hildreth, 1982; Bawden, 1986). Browsing should allow the user to follow up 
a particularly interesting document without losing the main thread of the query. In 
particular, browsing exploits the tendency, inherent in retrieval systems, for the system 
to identify "fringe" material, which may be related to the user's query in unexpected 
ways (Wade and Willett, 1988). In the IBM version of INSTRUCT, the user can invoke 
a browsing option after identifying one or more relevant documents in the initial search. 
INSTRUCT then allows any of these documents to be used as the basis for either a 
chain search or a seed search. 
Chain searching involves follovving the chains of related documents which thread 
their way through the file of nearest neighbours; this file was set up for the NNC 
searching routine. When a document is selected by the user as the basis for the browsing 
option, the nearest neighbour to this document is displayed, then that documenfs 
nearest neighbour and so on until the chain doubles back on itself, i.e., until a pair of 
reciprocal nearest neighbours is encountered (Murtagh, 1983). 
137 
Alternativelv, seed searching results in a best-match search being executed in which 
the original query stems are replaced by the stems in the title and abstract of a doc-
ument which has been chosen by the user, i.e., a relevant document identified in the 
initial search. This second search results in a ranking of documents in order of de-
creasing similarity with the seed; documents which have already been seen are then 
eliminated from the ranking. The remaining documents can be viewed by the user and 
the new documents can, in their turn, be selected for another seed search. To ensure 
a rapid response to a request, INSTRUCT selects only the 25 stems with the lowest 
postings (and hence the highest specincity) of the seed document. 
Feedback searching 
The relevance information obtained from the initial search and, possibly, from a browse 
search, is used by INSTRUCT to modify the original query in two ways: by modifying 
the weights associated with individual query stems (as described in previous sections), 
and by allovving the user to add to the query any stems which have tended to occur in 
documents they considered to be relevant. 
In this second option, known also as a relevance-based query expansion (Wade 
and Willett, 1988), rather than calculating the relevance weights just for the original 
query stems, the weights are calculated for ali of the stems which occur in any of the 
documents which have been identified as being relevant to the query. These stems are 
then sorted into order of decreasing weights and the twenty top-ranked stems displayed 
for the user to add to the original query as required. 
5.4 Main modules of the INSTRUCT package 
It can be seen from the above sections that both the original PRIME version and 
the IBM version of INSTRUCT allow the demonstration of many information retrieval 
techniques which have been studied intensively in information retrieval research over the 
last two decades. Since the IBM version of INSTRUCT includes aH modules developed 
138 
for the original version (the only exception is a guided search), the main components of 
INSTRUCT can be illustrated with the following list (words in italics indicate facilities 
which were added to the original version): 
• User interface: 
- novice 
- experienced user 
- guided search (not included in the IBM version) 
- expert 
• Query formulation: 
- natural language input 
- exclusion of stop-words 
- automatic word conflation 
- assignment of initial vreights to stems 
• Query reformulation: 
- addition of query terms 
- deletion of query stems 
- expansion of query stems: 
* string similarity (trigrams) 
* co-occurrence data 
• Searching: 
- best-match search 
- Boolean search 
- cluster-based search 
• Post-search options: 
139 
— hybrid search (imposing Boolean constraints) 
— browsing: 
* chain search 
* seed search 
— feedback search: 
* reweighting query stems 
* tveighting stems in relevant documents (relevance-based query expan-
sion) 
As has already been noted, both versions of INSTRUCT have been widely used as a 
teaching aid and as a useful test bed. The next sections will briefly describe how these 
functions of INSTRUCT have been used at the Department of Information Studies, 
University of Sheffield. 
5.5 The use of INSTRUCT at the University of Sheffield 
5.5.1 Use in teaching programmes 
The main use of INSTRUCT in teaching programmes is in relation to the information 
storage and retrieval courses. Over the years, the Department has developed several 
different simulations for on-line searching (Wood, 1981). Simulations have been de-
signed to familiarize students with information retrieval techniques that they are likely 
to encounter when they enter employment. Their great value is in providing stress-
free, hands-on practice for large number of students before they use the real on-line 
service. However, aH the simulations (the latest is DIAL-SOS which simulates DIALOG 
searching on IBM-compatible microcomputers) have been designed to demonstrate con-
ventional, Boolean retrieval methods. 
The encouraging results reported from experimental information retrieval research 
led to the idea of developing INSTRUCT, primarily as a teaching aid, in order to 
achieve the following aims (Willett and VVood, 1989): 
140 
• to allow students to use some of the more advanced retrieval techniques which 
were not widely used by the on-line services, 
• to allow students to specifv real queries that could be searched against a document 
database of non-trivial size. 
Thus, the IBM version of INSTRUCT is introduced to the students in the first term 
with two main purposes: 
• to become familiar with the routine usage of a computer-based information re 
trieval svstem. The main advantage of INSTRUCT lies not only in the possibility 
of searching documents in a subject area which is familiar to students, but also 
in allowing the students to search a large database with real queries at no cost; 
• to become familiar with the differences between Boolean and best match search 
ing. Although INSTRUCT was primarily designed to demonstrate the latter 
facilities, the Boolean module (based on AND, OR, and NOT operators) is quite 
sufficient to serve as a teaching aid to students. The resulting skills can then be 
used in real database searching (e.g., usage of the CD-ROM version of the LISA 
database). 
In addition, INSTRUCT forms a primary teaching resource for illustrating advanced in 
formation retrieval techniques (e.g., index term weighting schemes, automatic relevance 
feedback, word stemming) in elective courses in the second term. 
5.5.2 Use in research programmes 
Since its initial implementation, INSTRUCT has also been shown to be a useful test 
bed for investigating research problems in information retrieval. The following is a list 
of some experiments which have been carried out recently: 
• Comparisons of the effectiveness of Boolean and best-match searching; these com-
parisons have been done both qualitatively (Nelis, 1985) and quantitatively (Mo-
han, 1987). 
141 
• A comparative evaluation of the effectiveness of searching using a knowledge-
based information retrieval system (i.e., PLEXUS, see Vickery et al., 1987; in a 
later stage, its commercial implementation, i.e., Tome Searcher was also used) 
and a statistically-based system (i.e., INSTRUCT); the results of this experiment 
are described by Wade et al., 1988. 
• Best-match searching using text signatures (Wade et al., 1989). 
• Full text searching, providing the ranking of paragraphs within a document (Al-
Hawamdeh and Willett, 1989). 
5.6 Processing of documents and queries in a Slovene 
language free-text retrieval svstem 
The processing routines in INSTRUCT are, in very large part, independent of the actual 
language in which the texts have been written. The only exception is the stemming 
algorithm which has to take account of the morphological structure of the particular 
language. The original version of INSTRUCT is thus based on the stemming algorithm 
developed for the English language by Porter (1980). 
This feature of INSTRUCT—together with the fact that the package has been 
written in a standard PASCAL programming language—indicates that it should be 
relatively easy to convert INSTRUCT to allow best match searching of text in any 
particular language that ušes suffixing to create word variants in a manner analogous 
to that of English. Thus, to implement a Slovene language-based free-text retrieval 
system as the main goal of this PhD thesis, two important stages of research work had 
to be carried out: 
• design of a stop-word list and development of a stemming algorithm for the 
Slovene language; 
• modification of the English-based version of INSTRUCT in order to incorporate 
the possibility of processing Slovene documents and queries. 
142 
Since the first stage of research work (i.e., stemming algorithm and stop-word list) is 
described in detail in Chapter 4 (see also Popoviè and VVillett, 1990), the next section 
will brieflv outline the main modifications of the original version of INSTRUCT. 
5.6.1 The Slovene version of INSTRUCT 
The main idea behind the development of a stop-word list and a stemming algorithm for 
the Slovene language was to provide end-user access to bibliographic databases (written 
in Slovene) using best-match searching techniques. It was, therefore, assumed that 
retrieval modules implemented in the original (PRIME) version of INSTRUCT should 
be sufficient to incorporate the Slovene stemming algorithm, and also to demonstrate its 
performance effectiveness. In addition, a decision to implement the Slovene version of 
INSTRUCT on IBM PC-compatible hardware supported the idea of using the original, 
PRIME version of INSTRUCT, and not the later IBM version. The conversion of ali 
modules available in the IBM mainframe version would result in serious P C memorv 
allocation problems and thus lead to time-consuming re-writing of the whole program. 
The Slovene version of INSTRUCT was thus implemented on the basis of the con 
version of the original (PRIME) version of INSTRUCT to an IBM PC compatible 
microcomputer using the TURBO PASCAL 5.5 programming language. This means 
that the Slovene version of INSTRUCT consists of the following main modules: 
• natural language query input (in Slovene); 
• elimination of non-content bearing terms from the query (using a dictionarv of 
1,593 Slovene stop-words); 
• stemming of remaining query terms (using the algorithm described in Chapter 
4); 
• morphological term expansion using a string similarity measure; 
• best-match searching (with the possibility of imposing Boolean constraints after 
the initial search has been carried out); 
143 
• relevance feedback searching; 
• Boolean searching. 
The major alterations to the original version of INSTRUCT were carried out in the 
language-dependent modules in order to achieve the main goal, i.e., the successful 
processing of Slovene terms both in queries and in documents. Some other modifications 
which could reflect the substantial developments in information retrieval research since 
the design of the original version of INSTRUCT (e.g., development of a WIMP-based 
interface) were therefore not implemented. Any qualitative evaluation of the Slovene 
version of INSTRUCT should take into account these limitations. 
Creation and processing of the Slovene document collection 
It has alreadv been noted that the document collection for the original version of 
INSTRUCT consisted of 6,004 records from the LISA database. Each record consisted 
of three main fields, i.e., the title, abstract and accession number. Since this document 
collection represents an example of a database written in the English language a primarv 
task in developing a Slovene version of INSTRUCT was to find an adequate Slovene 
substitute for the LISA database. 
Since no less than 27 bibliographic databases have been created over the last few 
years by specialized information centres and libraries in Slovenia, as reported by the 
Research Community of Slovenia (1989), it was expected that one of these databases 
could serve as a test collection for the design of the Slovene version of INSTRUCT. 
Hovrever, the absence of abstracts from these databases (i.e., documents are described 
by only basic bibliographic units as, for example, a title, keywords, etc.) could present 
serious limitations in evaluating the retrieval performance of both the Slovene stemming 
algorithm and best-match searching. In order to run INSTRUCT against a search file 
having a larger amount of free text (e.g., abstracts from articles) it was thus decided 
to build a new test document collection. This document collection consists mainly of 
articles from two journals, i.e., Knjižnica (217 articles, covering the period 1972-1990) 
and Informatologia Yugoslavica (287 articles, covering the period 1969-1989). It should 
144 
be noted that Informatologia Yugoslavica contains articles written in ali Yugoslav lan-
guages, and consequently, some of the units had to be translated into Slovene before the 
actual input. The document collection covers the area of librarianship and information 
science and contains 504 units; each unit is represented by the identification number, 
title, source, and abstract. 
The processing of this document collection was carried out in a manner similar to 
the processing of the LISA document collection, i.e., a series of programs were used to 
create an inverted file. Some of these programs needed alterations, particularly those 
related to the processing of the Slovene language (e.g., the program dealing with the 
exclusion of stop-words from documents and the program responsible for the stemming 
of the remaining terms in documents). In summary, an inverted file, consisting of 
a dictionary, posting and display files, also forms a basis of the Slovene version of 
INSTRUCT. This version was indexed by 2,957 stems. 
Modifications to the source code of INSTRUCT 
The original version of INSTRUCT was written in modular form using the standard 
PASCAL programming language to allow both portability and ease of any future mod 
ifications or enhancements. The only exceptions are the disk-handling routines, which 
are inevitably system-dependent. 
It was thus quite easy to transfer the source code of INSTRUCT to an IBM PC com-
patible microcomputer (MS-DOS operating svstem) using the TURBO PASCAL 5.5 
programming language. The following are some modifications needed to accommodate 
MS-DOS file handling routines and the TURBO PASCAL programming style: 
• substitution of PRIMOS disk-handling routines with MS-DOS routines (e.g., rou 
tines for opening or closing files); 
• division of the large source code of INSTRUCT into smaller units as required by 
TURBO PASCAL; 
• application of STRING variables throughout the whole program. 
145 
Hovvever, some major modifications of the program were also required, particularly in 
order to incorporate facilities for handling a large stop-word list and, most importantly, 
automatic conflation of Slovene terms. As described in Chapter 4, the Slovene stemming 
algorithm is based on the longest-match principle using a list of 5,276 endings with 
associated context-sensitive rules. In addition, three types of recoding rules are applied 
after suffix deletion. This approach differs very much from Porter's stemming algorithm, 
which is based on the iteration principle. Thus, considerable alterations were carried 
out in the modules dealing with automatic word conflation. To summarize, two-months' 
work by the author was required for both building a document collection and producing 
a Slovene version of the INSTRUCT package. 
Despite the fact that stemming of Slovene words could be potentially demanding 
of computational resources (since word conflation is carried out in two steps using a 
large dictionary of suffixes) the initial results are quite encouraging. It seems that 
both the design and the implementation of the Slovene stemming algorithm within 
the INSTRUCT package were able to achieve the most important goal, i.e., a balance 
between the quality of algorithm and simplicity and efficiency of processing. 
In order to demonstrate both the performance of the stemming algorithm and some 
other retrieval facilities the next section illustrates the search process carried out using 
the Slovene version of INSTRUCT. 
5.7 An example of best-match searching using the Slovene 
version of INSTRUCT 
This section gives an example of the use of INSTRUCT to carry out a best-match search 
for the query VISOKOŠOLSKO IZOBRAŽEVANJE IN BIBLIOTEKARSKI KADRI 
(i.e., UNIVERSITY EDUCATION FOR LIBRARY STAFF). The example has been 
taken directly from the output displayed on the screen during the execution of the 
search. The output has been edited to eliminate many blank lines and the repeated 
display of the list of stems. User inputs are in italics (INSTRUCT commands are writ-
146 
ten in English to allow the English reader to follow the execution of the search process; 
for the user in Slovenia they are, of course, vvritten in Slovene). 
> INSTRUCT 
* • • * WELCOME TO INSTRUCT * * * * 
(An INteractive System for Teaching Retrieval Using Computational Techniques) 
If you 
• are an experienced user of INSTRUCT, type : 1 
• would like the opportunity to see some 
explanatory text at various points 
throughout the search, type : 2 
Please type response now: 
> 1 
Please input a query ending with a #: 
> Visokošolsko izobraževanje in bibliotekarski kadri # 
The current form of the search term list is 
as follows: 
1 VISOKOŠOL (18) 2 IZOBRAZ (73) 
3 BIBLIOT (66) 4 KADER (24) 
To: 
• add any terms, type : 1 
• delete any stems, type  2 
• let the system find some related terms, type : 3 
• perform a BEST-MATCH search, type  4 
• perform a BOOLEAN search, type : 5 
• leave INSTRUCT, type : 6 
Please type response now 
> 3 
You now have the opportunity to expand your query 
• If you select one of the stems shown below then a group of 10 similarly spelt (or misspelt) 
terms from the index will be displayed to you. 
• You can then add any useful stems to your list of search terms. 
147 
To expand on any term type its NUMBER and press RETURN 
otherwise type # and press RETURN 
THE SEARCH IS NOW BEGINNING 
it might take some tirne so please wait 
You have chosen to search on IZOBRAZ 
and the follovving stems have been retrieved: 
1 PREDIZOBRAZ (1) 2 SAMOIZOBRAZ (1) 
3 SPLOŠNOIZOBRAZ (36) 4 VISOKOIZOBRAZ (1) 
5 OBRAZ (5) 6 PREOBRAZ (4) 
7 IZRAZ (6) 8 FRAZ (1) 
9 IZKAZ (1) 10 OBRAZLOŽ (1) 
To include any of these new terms in your query type in their 
numbers flnishing with a # and press RETURN; e.g., 1 2 3 #. 
Otherwise just type # and press RETURN 
>4# 
The search term list is now as follows: 
1 VISOKOŠOL (18) 2 IZOBRAZ (73) 
3 BIBLIOT (66) 4 KADER (24) 
5 VISOKOIZOBRAZ (1) 
To expand on another term type in its NUMBER and press RETURN. 
To leave this section(and to return to the main menu) type # and press RETURN. 
> # 
To: 
• add any terms, type : 1 
• delete any stems, type  2 
• let the system find some related terms, type : 3 
• perform a BEST-MATCH search, type  4 
• perform a BOOLEAN search, type : 5 
• leave INSTRUCT, type : 6 
Please type response now 
>4 
148 
BEST-MATCH SEARCH 
THE SEARCH IS NOW BEGINNING 
it might take some time so please wait 
The search has retrieved a total of 142 documents having at least one term in common with 
the query. 
Do you want to limit results with any Boolean constraints? (Y/N) 
> N 
How many documents do you want to see? 
If you just press RETURN five documents will be displayed 
> < RETURN> 
How much of the data do you want to see: 
• Title only type : 1 
• Title and Abstract type  2 
> 1 
If at any stage you decide that you don't want to see any more documents just type #. 
1/163 
Kadri in znanje: Primer splošnoizobraževalnih knjižnic v Ljubljani 
* Kolenc, J.: Knjižnica, 32(1988)1/2, str. 23-45 
Do you consider this document relevant to your query? (Y/N) 
Or if you need to see the abstract as well, type 'A' 
> A 
1/163 
Kadri in znanje: Primer splošnoizobraževalnih knjižnic v Ljubljani 
* Kolenc, J.: Knjižnica, 32(1988)1/2, str. 23-45 
Poleg ustreznih prostorov in sodobne opreme so kakovostni kadri (strokovni delavci) bistveni 
pogoj za uèinkovito izvajanje knjižnièno-informacijske dejavnosti. Tudi KIS v SRS in SFRJ na 
menjena temu vprašanju poseben pomen. Kadrovska funkcija v splošnoizobraževalnih knjižnicah 
pa ni le potentia temveè tudi agentia (agens) vsakršnega razvoja. Zahtevani prehod v informa 
cijsko družbo postavlja tudi pred nas imperativ: profesionalizacija! Znanje se v knjižnièarstvu 
lahko pretaka le skozi visokoizobražen in kvalificiran kader.Razvoj višjih oblik delitve dela je 
pogoj za dvig družbenega ugleda (statusa) knjižnièarskega poklica. 
Do you consider this document relevant to your query? (Y/N) 
> Y 
149 
2/155 
Izobraževanje knjižnièarskih delavcev v Jugoslaviji 
* Berèiè, B.: Knjižnica, 31(1987)1, str. 110-129 
Do you consider this document relevant to your query? (Y/N) 
Or if you need to see the abstract as well, type 'A' 
> Y 
3/185 
Zaèetek visokošolskega študija bibliotekarstva v Sloveniji 
* Berèiè, B.: Strokovni posvet ZBDS, Bled 1987, 13 str. 
Do you consider this document relevant to your query? (Y/N) 
Or if you need to see the abstract as well, type 'A' 
> Y 
4/79 
Kako je pri nas z razvojem bibliotekarske vede? 
* Sepe, M.: Knjižnica, 26(1982)1/2, str. 76-80 
Do you consider this document relevant to your query? (Y/N) 
Or if you need to see the abstract as well, type 'A' 
> A 
4/79 
Kako je pri nas z razvojem bibliotekarske vede? 
* Sepe, M.: Knjižnica, 26(1982)1/2, str. 76-80 
Razvoj knjižniène stroke naj bi bil preplet treh dejavnikov: Narodne in univerzitetne knjižnice, 
ki s svojo matièno službo kot organizatorjem dejavnosti, z raziskovalnim centrom kot središèem 
raziskovalnega dela in ob sodelovanju strokovnjakov z drugih specialnih podroèij bibliotekarske 
stroke predstavlja nosilca razvoja bibliotekarstva (to NUK-u nalaga tudi Zakon o knjižnièarstvu). 
Skrb za preuèevanje in razvoj posameznih strokovnih podroèij naj še naprej ostane pri Društvu 
bibliotekarjev. Tretji dejavnik, ki prispeva k razvoju stroke, pa naj bi bile tudi knjižnice same, 
predvsem s strokovnim izobraževanjem svojih kadrov. 
Do you consider this document relevant to your query? (Y/N) 
> Y 
5/204 
Trženje v visokošolski knjižnici 
* Èešnovar, N.: Knjižnica, 33(1989)3/4, str. 123-128 
150 
Do you consider this document relevant to your query? (Y/N) 
Or if you need to see the abstract as well, type 'A' 
> A 
5/204 
Trženje v visokošolski knjižnici 
* Èešnovar, N.: Knjižnica, 33(1989)3/4, str. 123-128 
Marketing oziroma trženje je proces upravljanja, ki ga enaèimo s prièakovanji, zahtevami in 
zadovoljitvijo strank, pri tem pa imamo dobièek. Knjižnica, ki se tržno obnaša, mora: zago 
toviti hiter dostop do svojega gradiva; izvajati medknjižnièno izposojo, kreirati informacijske 
zbirke, ki še ne obstajajo; slediti mora modernemu razvoju informacijskih služb. Da lahko sledi 
tem zahtevam, morajo biti kadri v visokošolskih knjižnicah primerno usposobljeni, da lahko 
ugotovijo in strokovno presodijo, kaj uporabnik potrebuje. Na tej stopnji pa informacijsko delo 
že zaène prerašèati v raziskovalno dejavnost. 
Do you consider this document relevant to your query? (Y/N) 
> N 
At this point, if insufficient numbers of useful documents have been obtained, the 
user can inspect more documents, do a relevance feedback search based upon the rele-
vance judgements given to the system, or carry out a Boolean search. 
5.8 Conclusions 
In this chapter we have described INSTRUCT, an interactive computer program which 
demonstrates some of the techniques which have been suggested for implementing on-
line bibliographic retrieval systems. The incorporation of a stemming algorithm into a 
Slovene version in INSTRUCT represents an important step towards the introduction 
of non-conventional searching techniques into Slovene information retrieval systems. 
However, INSTRUCT cannot be introduced into the Slovene information retrieval en-
vironment without obtaining extensive evaluation results of its retrieval performance. 
A description of the experimental test which was used to assess its performance is 
therefore presented in the next chapter. 
151 
Chapter 6 
Evaluation of the Stemming 
Algorithm for Slovene IR — 
Experimental Environment 
6.1 Introduction 
Conclusions from previous chapters, mainly Chapter 4 and Chapter 5, can serve as a 
starting point for designing and conducting experimental tests on the retrieval perfor-
mance of automatic word conflation in Slovene IR svstems. These conclusions have 
indicated that statisticallv-based techniques for information retrieval could also be ap-
plied to the process of searching documents in Slovene language databases. However, 
in order to prove this assumption, an experimental test has to be carried out. 
Although there are some exceptions—see, for example, the article on using non-
conventional retrieval techniques in German language-based IR svstems (Fuhr, 1990), 
the article on the Finnish stemming algorithm (Jappinen et al., 1985), and the article 
on IR experiments based on a French set of queries and documents (Chiaramella and 
Defude, 1987)—experiments on the performance of statisticallv-based techniques have 
been so far mainlv carried out on English test collections (see, for example, Lennon et 
152 
al., 1981; Harman, 1987). Therefore, the main problem to be investigated is contained 
in the following question: are non-conventional, statistically-based techniques of docu 
ment retrieval applicable also to the Slovene language9. This question is of particular 
importance because of a requirement for a multi-lingual approach to document retrieval 
in Slovenia. 
In order to test the correlation between non-conventional, statistically-based tech-
niques for document retrieval and the characteristics of the Slovene language, two exper-
iments were carried out. The first experiment, referred to subsequently as Ezperiment 
I, was concerned with testing the retrieval performance of automatic word conflation 
in Slovene. Automatic stemming was compared with two other types of text represen-
tation, i.e., manual right-hand truncation, carried out by a trained intermediary, and 
non-stemming. The results obtained from Experiment I are described in Chapter 7. 
The second experiment, referred to subsequently as Ezperiment II and described in 
Chapter 8, deals with the multi-lingual approach to statistically-based IR methods and 
techniques. On the basis of performance results from two document collections, the 
first written in Slovene and the second an English translation of the Slovene texts, an 
experiment was carried out to test whether statistically-based methods of information 
retrieval can be successfully applied to two different languages. 
In this chapter, the following main components of the test environment for Ezperi-
ment I will be described. First, the question of whether a laboratory or an operational 
test should be employed will be considered. The following section on test collections 
will outline their three components: document collection, a set of queries, and relevance 
assessments. Then, the implementation of three different text representation modules 
within the INSTRUCT package will be discussed. The final section will describe the 
methods which were used for the analysis of collected data. 
6.2 Laboratory versus operational tests 
As pointed out above, the main objective of Experiment I was to test whether automatic 
word conflation can be implemented in Slovene IR systems without any average loss of 
153 
retrieval performance, thus allowing users easier access to the systems. This objective 
relates directly to the most difficult experimental dilemma in IR: what kind of test 
should be carried out? Should it be a laboratorij or an operational test? 
Cleverdon (1966) made this distinction clear in the Cranfield 2 experiments. Op 
erational tests normally involve an evaluation of an existing system; laboratory tests 
attempt to advance knowledge about individual variables of information retrieval. The 
idea of conducting experiments under laboratory conditions is to control ali variables 
as far as possible (Robertson, 1981) and, according to Bawden (1990), experiments of 
this kind are carried out only in order to test some hypothesis which the experiment 
will prove or disprove. 
Although laboratorij-style evaluations of IR systems have recently been under in-
creasing criticism, particularly from the advocates of user-oriented evaluations (UOE)— 
see, for example, Ellis (1990), Bawden (1990)—the following reasons dictated the ap-
plication of a laboratory-type evaluation in Ezperiment I: 
• the experiment was not concerned with the performance of the complete IR sys-
tem (i.e., INSTRUCT), but only with one of its sub-systems, i.e., the retrieval 
performance of automatic word conflation using a single searching strategy; 
• the only variable under investigation was the stemming algorithm and its influence 
on retrieval performance; 
• the experiment was based on the simple comparison of automatic word conflation 
with manual right-hand truncation and non-conflation; 
• no user-oriented variables (e.g., human factors) were taken into account as a part 
of this experiment. 
Manual right-hand truncation and automatic stemming have the same purpose, i.e., 
to improve retrieval performance, in particular its recall. The Slovene stemming algo 
rithm can therefore be evaluated by comparing its performance results with the results 
obtained by the application of manual right-hand truncation. Or, in other words, a per-
154 
formance at least equivalent to the manual right-hand truncation means that stemming 
can be automated. 
Experiment I was accordingly concerned with three different types of text repre-
sentation (automatic stemming, manual right-hand truncation and non-conflation) and 
their influence on retrieval performance, expressed in terms of recall and precision. In 
order to carry out experimental testing, the following procedures were required: 
1. design and development of a test collection; 
2. implementation of three different modules of query and document processing 
within INSTRUCT; 
3. searching of documents in a database; 
4. evaluation of results. 
As noted above, this chapter will be concerned mainly with the first two items. 
6.3 Test collection 
It has been widely accepted in laboratory-type IR experiments that test collections 
involve three aspects: 
• a collection of documents; 
• a set of search requests, i.e., queries; 
• relevance judgments relating the search request to the document. 
The literature on IR experiments reveals that most of the existing test collections 
were developed for searching documents written in the English language. A compre-
hensive list of these test collections can be found in Sparck Jones and Bates (1977). 
Despite many sorts of criticism of these test collections—in particular on the problem 
of relevance assessments (see, for example, Ellis, 1990; Bawden, 1990)—the existing 
155 
test collections have been demonstrated to be very useful tools for various IR experi-
ments. For example, Harman (1987) tested different stemming algorithms using three 
test collections. 
Since Experiment I was defined to be a laboratory-type of evaluation, an appro-
priate Slovene test collection, consisting of Slovene documents, queries and relevance 
judgments was required. However, the lack of any experimental work in information 
retrieval and a deficit in evaluating existing IR systems in Slovenia has consequently 
resulted in the fact that the test collection designed as part of Experiment I will be the 
first such collection in Slovenia. A description of components of this test collection is 
presented below. 
6.3.1 Documents 
According to Robertson (1981) a document set is, in theory at least, taken as more 
or less synonymous with text in linguistics, i.e., it describes any piece of linguistic 
material that can reasonably be considered as a unit (e.g., a scientific paper). For the 
purpose of evaluating a stemming algorithm it is of particular importance to have a set 
of documents consisting of terms in natural language, i.e., illustrating the morphology 
of a particular language. Thus, apart from a title, at least an abstract should form a 
unit in a document collection. 
The design and development of a Slovene document collection was described in a 
previous chapter. The following is an example of a unit in this document collection. 
Knjižnièno-informacijska dejavnost v Sheffieldu, Velika Britanija: vtisi s strokovnega 
izpopolnjevanja 
* Popoviè, B.: Knjižnica, 34(1990)1/2, str. 85-100 
Prispevek opisuje delovanje in povezovanje knjižnic in informacijskih centrov v 
Sheffieldu, Velika Britanija. Po opisu nekaterih najzanimivejših knjižnic in infor 
macijskih centrov v Shemeldu se dotakne predvsem naslednjih tematskih sklopov: 
obdelave gradiva in organizacije dela v knjižnicah, avtomatizacije, povezanosti 
knjižnic in informacijskih centrov z družbenim okoljem ter povezovanje knjižnic 
in informacijskih centrov v regionalnem okviru. Nekaj besed je tudi o reševanju 
finanène problematike v knjižnicah, kar bo tudi za nas verjetno vedno bolj aktualno. 
156 
The size of the test collection in terms of number of documents (i.e., 504 units 
which represent aH available articles in the two journals) is relativelv small. However, 
the nature of the experiment, i.e., a simple comparison of three different types of text 
representation, and the limited number of variables to be controlled, were sufficient 
arguments for a decision about the size, form, and the subject coverage of the document 
collection. 
6.3.2 A set of queries 
According to Tague (1981), a query is the verbalized statement of a user's need and 
this is often expressed as a short natural language question or statement and may 
be accompanied by terms chosen manually from an indexing language. So far, great 
variability has been exhibited by IR experiments in their methods of obtaining queries. 
The following are some of the issues that must be considered in the process of obtaining 
test queries: 
• The searcher must be correctly aware of the user's requirements, and, therefore, 
the query must be properly defined and covered (Salton, 1975). Tague (1981) 
adds that any unclear queries should be rejected; 
• There is much variation shown by IR experiments in their method of obtaining 
requests; one may either solicit the co-operation of the actual users of a system 
or use queries which are in some sense artificial but under greater control of the 
investigator. 
• Ideally, users should be randomly selected from a pool by the investigator but 
this is rarely possible as users are normally self-selected because of the degree of 
cooperation required of them. 
These issues about designing a set of queries in IR experiments have served as a starting-
point in query development for this experiment. 
157 
Query development 
In this experiment, eight (8) different persons were used to generate a total of forty-eight 
(48) different search requests in the librarianship and information science field. This is 
a reasonable number to allow for at least some assurance that the results obtained are 
not simple artifacts of an inadequate query set (for more information on sample sizes 
see Robertson, 1990). A list of ali of the queries can be found in Appendix B. 
The persons were selected on the basis of research specialization, availability, and 
willingness to judge the relevance of documents retrieved. Each person was familiar 
with the library and information science field, either being a librarian himself or a 
researcher in library science. Each person was asked to produce six requests that were 
either of his interest or might be asked by library and information science researchers. 
To aid in the query generation, a detailed and carefully drawn set of instructions was 
distributed to the group of query authors. The main criteria proposed for the query 
formulation are similar to those designed by Lesk and Salton (1969) and are presented 
in Table 6.1. 
Table 6.1 shows that each query was expected to represent a real information need, 
and had to be expressed in grammatically correct and, hopefully, unambiguous Slovene. 
Positive formulations were required and the queries were to be generated independently 
from the document collection; in particular, no source document was to be used for the 
formulation of any of the queries. In addition, no limits were defined as to the numbers 
of words or sentences in each query. 
Despite detailed instructions on how each query should be formulated, the nature of 
a document collection, in particular its small size, required additional interviews with 
each of the requesters. These interviews were needed because of the following: 
• some of the queries covered very specific topics and their processing could result 
in retrieving a very small number of relevant documents; 
• some of the requesters provided similar queries, i.e., covering the same topic; 
• while some queries were enhanced with synonyms and abbreviations, other queries 
158 
Positive criteria for 
query formulation 
1. Generate queries of real interest 
to a potential researcher 
2. Formulate queries in clear, 
coherent and grammatically 
correct sentences 
3. Use positive formulations 
stating what subject areas are 
actually wanted 
4. Use homogeneous query formulations 
representing a single topic 
5. Use only common abbreviations 
Negative criteria for 
query formulation 
1. Avoid "exotic" topics and 
doubtful subject matters 
2. Do not submit queries 
corresponding to the 
contents of a specific document 
3. Avoid negative formulations, 
introduced by "except", "not", 
"other than", etc. 
Table 6.1: Principal criteria for query formulation 
were short, simple statements. 
Thus, the additional interview helped towards a formulation of the search topics which 
were clear and meaningful. Since some of the original queries were altered (for example, 
to cover a broader subject area or to include additional keywords, phrases or common 
abbreviations) the existing set of 48 queries can therefore be defined as a combination 
of real and structured queries. Table 6.2 shows the main characteristics of this set of 
queries before and after stopvrording has been carried out. 
Before deletion of stop-words, the largest query consisted of 19 terms and the small-
est query of only 2 terms. The median number of terms per query was 8. Since queries 
were formulated mainly as natural language sentences it is interesting to see their quan-
titative characteristics after the removal of stop-words. The total number of words was 
reduced from 370 to 293, and the median number of terms per query was 6. 
In order to carry out a comparative evaluation of automatic stemming and manual 
159 
Quantitative characteristics 
Number of queries 
Total number of terms in a set of queries 
Average number (median) of terms per query 
Maximum number of terms in a query 
Minimum number of terms in a query 
Before 
stopwording 
48 
370 
8 
19 
2 
After 
stopvvording 
48 
293 
6 
15 
2 
Table 6.2: The main characteristics of the set of queries before and after deletion of 
stop-words 
right-hand truncation, ali queries from the test set were also processed by a trained 
intermediary. The every-day job responsibility of this person is to search information 
from various English and Slovene databases. Since ali Slovene document databases 
can be described as descriptor-based collections, the processing of natural language 
queries in Slovene was a new and demanding linguistic experience for the intermediary. 
Different tools (linguistic handbooks, dictionaries) were therefore used as a help during 
manual processing of queries. 
Using knovvledge about Slovene linguistics, the professional intermediary thus trun-
cated ali terms (apart from stop-words) manually on the right with the objective of 
achieving good retrieval results. For example, if the query consisted of the following 
statement: 
Karkoli o Narodni in univerzitetni knjižnici (NUK) v Ljubljani 
(Anything about the National and University Library in Ljubljana) 
the professional intermediary first excluded stop-words (i.e., KARKOLI, O, IN, V) 
and then formulated the query, using right-hand truncation. The result of his manual 
involvement was a list of truncated terms, as shown in the example below: 
Narodn? univerz? knjižnic? NUK? Ljubljan? 
160 
A list of ali queries manuallv reformulated by the professional intermediary can be 
found in Appendix C. 
6.3.3 Relevance assessments 
Most commonly, documents output by the system in IR experiments are individually 
assessed for relevance to the user's need. The word relevance—as pointed out by 
Robertson (1981)—has been used in many different ways, but broadly corresponds to 
the question: how well does the document match the user's need? 
There is a wealth of literature relating both to relevance in general and to obtain-
ing relevance assessments when setting up a test collection in particular. First of ali, 
there is no doubt that relevance is a subjective notion, i.e., different users may dif-
fer about the relevance or non-relevance of particular documents to a given question. 
However, van Rijsbergen (1979) suggests that the difference is not large enough to 
invalidate experiments which have been made with test collections for which requests 
with corresponding relevance assessments are available. 
One major study has investigated the factors that influence relevance decisions 
(Cuadra and Katter, 1967). This shows that relevance judgments are influenced by 
many factors (e.g., the skills and attitudes of the judges, the documents used). Thus, 
relevance is an imprecise concept, always dependent on the precise situation of the user 
(Bawden, 1990). 
However, since the performance measures in IR experiments rely on being able to 
distinguish useful and required items from those which are not so useful, some type of 
relevance assessment is required. Lesk and Salton (1969) have found that inconsistency 
of relevance assessments may have no effect on certain aspects of system evaluation. In 
fact, Robertson (1981) suggests that assessment of relevance allows a harder form of 
analysis than any other assessments in this category. 
Having in mind the vague definition of relevance and its subjective notion there are 
many practical problems arising in obtaining relevance judgments. Tague (1981) points 
out that actually getting assessments of document relevance is an even greater problem 
161 
than getting queries. Robertson (1981) listed the following aspects of the process of 
obtaining relevance assessments: 
1. Who is to make the assessments? Robertson (1981) states that where the request 
is stimulated by a genuine information need, the requester should decide on rel 
evance. This may cause problems since the user may not be prepared to assess 
as many documents as desired by the experimenter. Many experiments rely on 
third parties for assessments though this is regarded with distrust. 
2. How much of the document should the relevance judge see before making a judg-
ment? Ideally, the entire text of the document should be produced but usually 
titles and abstracts are used. Titles alone are very poor indicators. 
3. VVhich documents should be judged? Ideally, the whole document collection 
should be assessed but as this would usually be impossible (because of the size 
of the document collection under consideration), the set to be judged will often 
consist of the pooled output of various searches. 
4. What instructions should be given to the judges, and in what form should the 
assessments be obtained? It is important that ali individuals making relevance 
assessments receive the same instructions. It has frequently been pointed out 
that relevance embodies two distinct notions, i.e.: 
(a) is the document about the subject of the query? (i.e., the aboutness criteria); 
(b) will the document be useful to the user? (i.e., the usefulness criteria). 
Users should, therefore, be clear whether they are assessing subject relevance or 
pertinence. In the former usage a relevant document is simply one which deals 
(to a greater or lesser extent) with the same subject matter as that of the query, 
whereas for a document to be pertinent it has to contain information which is 
new and useful to the originator of the query in the subject area of the query. 
In addition, more than two categories (i.e., relevant/non-relevant) of relevance as 
sessments are often provided to the users, although there are not any experimental 
162 
performance measures that consider anything other than relevant/non-relevant. 
These additional degrees can be used only in the qualitative type of evaluation. 
At the analysis stage, they are conflated into just two categories. However having 
more than two degrees of relevance makes it easier for a judge to give a relevance 
value to documents in a collection. 
The practical problems as outlined above were considered in detail when requesters 
were asked to judge documents for relevance as part of Experiment I. The relevance 
assessments were carried out in the following manner: 
1. Relevance assessments of a set of retrieved documents were made by the requester, 
i.e., user; this means that no third party (for example, a team of experts in 
librarianship) was involved in this part of the experiment. 
2. Requesters judged documents for their relevance on the basis of information ob-
tained from the title and abstract of each retrieved document. 
3. The set of documents to be judged for relevance consisted of the pooled output of 
three different types of search (i.e., automatic stemming, right-hand truncation, 
non-conflation), using the ranked-output cutoff procedure. The cutoff point was 
10, i.e., only the first ten retrieved documents from each type of search were 
considered further. 
4. Ali relevance judges were given the same instructions about carrying out rele 
vance assessments. In the experiment reported here relevance was used with the 
meaning of aboutness, and not usefulness. A document was therefore judged to 
be relevant only 
"... if it is directly stated in the abstract as printed, or can be directly 
deduced from the printed abstract, that the document contains infor 
mation on the topic asked for in the query" (Lesk and Salton, 1969). 
Consequently, some of the documents were therefore stili judged to be relevant, 
although: 
(a) the requester has already read these documents, 
163 
(b) the requester does not usually read documents written by a particular au-
thor; 
(c) the document is obsolete, etc. 
5. Since this experiment was interested mainly in quantitative evaluation, the re-
questers were instructed to supply only binary (yes or no) relevance assessments. 
To summarize, the experiment reported here was using results from previous projects 
concerned with obtaining relevance assessments as a part of a test collection. The 
noticeable deviation from the "ideal" path can be found in two questions, i.e., which 
documents should be judged, and how many degrees of relevance assessments should 
be obtained from the users? A decision to use only a retrieved pool of documents, 
and to apply only yes/no criteria was based on the nature of this testing. The aim of 
the experiment was not to evaluate the whole system, but to find basic performance 
differences between three types of text representation. Since the project did not contain 
many variables to be controlled, the simplicity of its design was of crucial importance. 
Having obtained relevance assessments on the retrieved pool of documents, it be-
came possible to measure the retrieval performance of the three different types of search. 
However, in order to carry out such an evaluation, three different modules of query and 
document processing had to be implemented within an information retrieval system. 
Since INSTRUCT was selected as a test bed for this project, the incorporation of these 
modules in INSTRUCT is described in the next section. 
6.4 Text representation modules in INSTRUCT 
In order to carry out the test, the following three different modules were developed 
within the INSTRUCT package: 
• automatic stemming, 
• manual right-hand truncation, 
• non-conflation. 
164 
The main characteristics of each module and its implementation within the INSTRUCT 
package are described in the section below. 
6.4.1 Automatic stemming 
After stop-words are removed from both a query and documents in a database, suffix 
removal is carried out, using the stemming algorithm as described in Chapter 4. Then, 
a best-match search is carried out, achieving the ranking of documents. The best-
match searching algorithm and weighting techniques implemented in INSTRUCT were 
described in detail in Chapters 1 and 5. 
Suppose we want to search for documents containing a word ARHIVI. The inverted 
file of INSTRUCT consists of data as illustrated in Table 6.3. 
Dictionary file 
2 ARAB 
2 ARGUM 
3 ARHITEK 
19 ARHIV 
1 ARIS 
1 ARISTOK 
1 ARISTOT 
1 ARMAR 
1 ARTIKUL 
1 ARTOT 
Postings file 
154,166 
242,361 
128,381,390 
17, 24, 30, 66, 77,116,219, 
220,225,234,237,249,296, 
306,315,324,331,358,389 
166 
387 
416 
124 
433 
162 
Table 6.3: An excerpt from the inverted file of INSTRUCT (stemmed version) 
It can be seen from Table 6.3 that the word ARHIVI must be truncated to ARHIV*, 
in order to retrieve its morphological variants. The actual searching then results in a 
list of 19 document identifiers. Or, in other words, using the main part of the term 
weighting formula, i.e., 
log (-) + log (9.0) 
165 
where: 
N - number of documents in a collection, 
ni - number of documents indexed by r* term, 
9.0 - a value of the constant C, as defined by Croft and Harper (1979) and 
also explained in Chapter 1, 
then the elements of the array denoting the similarity between the query and each of 
the documents are incremented by 
log (—J + log (9.0) 
6.4.2 Non-conflation 
This module is in fact responsible only for the removal of stop-words from both docu 
ments and queries. After stop-words are deleted, the remaining words remain intact. 
It means that the part of the inverted file of INSTRUCT concerned with the word 
ARHIVI is as shown in Table 6.4. 
The first conclusion which can be drawn from Table 6.4 is that the dictionary tile 
of the unstemmed version of INSTRUCT consists of a larger number of terms than the 
dictionary nle in the stemmed version. It can be seen above that there are in fact 15 
variants of the stem ARHIV*. 
Since the non-conflated representation of text means that words from queries and 
documents remain unchanged, searching for documents containing the term ARHIVI 
results in only 2 document identifiers. The elements of the array within term weighting 
module are therefore incremented by 
log (~\ + log (9.0) 
6.4.3 Manual right-hand truncation 
The module performing manual right-hand truncation within INSTRUCT exhibits the 
following features: 
• the same inverted file is used as for the unstemmed version of INSTRUCT (see 
Table 6.4); 
166 
Dictionary file 
1 ARHITEKTURE 
4 ARHIV 
1 ARHIVA 
2 ARHIVI 
1 ARHIVIRANJU 
1 ARHIVISTI 
2 ARHIVISTIKA 
1 ARHIVISTIKE 
2 ARHIVISTIKO 
1 ARHIVOM 
6 ARHIVOV 
1 ARHIVSKE 
2 ARHIVSKEGA 
5 ARHIVSKIH 
1 ARHIVSKIMI 
1 ARHIVU 
1 ARIS 
1 ARISTOKRACIJE 
1 ARISTOTEL 
1 ARMARIJEV 
Postings file 
390 
116,296,358,389 
315 
30,225 
324 
225 
234,249 
331 
77,331 
237 
219,220,225,296,306,331 
331 
225,331 
17, 24, 66,225,331 
296 
358 
166 
387 
416 
124 
Table 6.4: An excerpt from the inverted file of INSTRUCT (unstemmed version) 
• words from the query are manually processed by a trained intermediary in two 
stages: 
- removal of stop-words from the query, 
- suffix removal from remaining words. 
Thus, if a trained intermediary truncates a searching term ARHIVI to ARHIV*, 
a dictionary file is searched to find various lists identified by a string ARHIV. Table 
6.4 indicates that there are 15 such lists. These lists are then OR-ed together, and x 
(where x = number of distinct identifiers) is counted. Finally, the elements of the array 
within the term weighting module are increment by using the following formula, 
log (~\ + log (9.0) 
167 
or, in this experiment, where x = 19, 
log (jf) + log (9.0) 
Thus, in this particular èase, both stemming and manual right-hand truncation 
would have precisely the same effect. 
To summarize, there are three main differences between the three types of text 
representation in INSTRUCT: the size of the dictionary file; the role of an intermediary; 
and the number of records retrieved. The first difference relates to the size of the 
dictionary file, as illustrated in Table 6.5, which demonstrates that the employment of 
the Slovene stemming algorithm, eliminating morphological variants of terms, results in 
a large compression of the dictionary file. The percentage of this compression (65.6%) 
is in accordance with the complex Slovene morphology and is even higher than that 
obtained in tests on a Slovene text corpus, as described in Chapter 4. In other words, 
the notion about the effect of the size of the database on the level of compression has 
again been confirmed. VVhile the application of the stemming algorithm on the text 
corpus consisting of 2,616 distinct word types resulted in a 54.7% level of compression, 
its employment on the larger dictionary file (i.e., 8,602 distinct word types) produced 
a 65.6% level of compression. 
Text representation 
Automatic stemming 
Truncation 
Non-conflation 
Number of terms in a 
dictionary file 
2,957 
8,602 
8,602 
Table 6.5: Frequency characteristics of dictionary files in the three different versions of 
INSTRUCT 
The next difference can be found within the query input module. While stemmed 
and unstemmed versions of INSTRUCT allow natural language input, manual right-
168 
hand truncation requires the active role of a trained intermediarv who has to conflate 
words at the right point before inputting a query. 
However, the largest difference between these three modules of text representation 
relates to the processing of a query within best-match searching. While automatic 
stemming and manual right-hand truncation allow retrieval of morphologically related 
terms, the non-conflated module can retrieve only those documents which exactly match 
words from the query. For example, searching for documents containing a word ARHIVI 
will result in a non-stemmed version of INSTRUCT in a list of only two documents; 
both other modules (i.e., automatic conflation and manual right-hand truncation) are 
able to retrieve a much larger number of relevant documents. 
There is no doubt that the differences outlined above between the three modules of 
text representation will have a large effect on the retrieval performance in this experi-
ment. 
In order to obtain comparable performance results from the three different types 
of text representation within INSTRUCT, a single search strategy, i.e., best-match 
searching, has been used in this experiment. Best-match or nearest neighbour searching 
is described in detail in Chapter 1. In addition, a description of its implementation 
within INSTRUCT can be found in Chapter 5 (see also Hendry et al., 1986a,b). 
6.5 Methods for the analysis of data 
Most information retrieval tests are ultimately concerned with the effectiveness of each 
system. In essence, the question of measuring effectiveness is simple: we want to 
decide how well the information system is operating, compared with some theoretical 
maximum (Bawden, 1990). In an operational situation, this means how well it meets 
the real needs of its users. In an experimental setting, some more or less artificial 
measures must be adopted as a surrogate for user satisfaction. 
In a laboratory setting the measures of performance used in the majority of retrieval 
tests are the well-known factors of recall and precision. These two measures of retrieval 
169 
effectiveness have been at the forefront of retrieval evaluation since Cranfield; they are 
defined using the following formulae: 
n vn ATT number of relevant documents retrieved 
total number relevant documents in collection 
nn^^rr,r^ »r number of relevant documents retrieved PRECISION = -^ :
 : ; total number documents retrieved 
Recall is therefore a measure of how well the system performs at yielding up ali 
the relevant items within it, and precision measures how well the svstem provides 
only the relevant items. One of the main achievements of the Cranfield tests was the 
experimental demonstration that recall and precision are inversely related (Bawden, 
1990). 
These performance measures, of course, depend on the relevance assessments of doc 
uments. VVe have already discussed the impreciseness of the concept of relevance. This 
concept also assumes that ali of the relevant items are equally useful. However, though 
their simplicity may be criticized, recall and precision are undoubtedly significant for 
practical evaluation. As Cleverdon (1966) states: "The unarguable fact however is 
that they are fundamental requirements of the users, and it is quite unrealistic to try 
to measure how effectively a system or subsystem is operating without bringing in recall 
and precision". 
Since this experiment was concerned with a simple comparison of three different 
types of text representation in queries and documents, recall and precision were found 
to be appropriate measures of retrieval performance. Therefore, to answer the question 
of which type of text representation performs the best, the results of searching were 
analyzed in the following stages (the term "system" here corresponds to a particular 
type of text representation within INSTRUCT): 
1. for each query and system, recall and precision were used as measures of the 
system's response to the request; 
2. for each system, these measures were averaged over the query set; 
170 
3. the averages for different systems were compared; 
4. the statistical significance of the difFerences between the systems were tested using 
the sign test and the Kendall coefficient of concordance, W. 
It has to be noted at this point that relative recall was calculated, based on the 
substitute list of total relevant items in the collection, i.e., a pool of retrieved documents. 
The true value of recall can only be calculated, if someone examines each and every 
item in the collection, to see if it is relevant. This was not the èase in Experiment 
I in which a pool of retrieved documents was used as a substitute for the complete 
document collection. 
Perhaps it should also be pointed out that the precision values were rather super-
fluous since, with a ranked-output cutoff procedure, ali the available information is 
contained in the recall figure - at a particular cutoff point if the recall figure for one 
system is better than that for another then the precision figure is similarly so. 
Although Ezperiment I was concerned mainly with a quantitative analysis of data, 
in particular using a measure of recall and precision, some other trends in IR experi-
ments were also taken into account. It is interesting to note that recent literature (see, 
for example, Bawden, 1990) emphasizes the development of a small set of queries which 
can then be used for a detailed qualitative analysis. An example of such an approach is 
a method known as failure analysis, which dates back to Lancaster's early MEDLARS 
experiments (Lancaster, 1969) and afterwards employed in many information retrieval 
tests (see, for example, McCain et al., 1987; Harman, 1991, etc). Its main purpose is to 
find out why things work as they do, and how matters may be improved. Sparck Jones 
(1981) notes that failure analysis "...is not part of an experiment proper, but makes a 
very important contribution to the broader study of retrieval system behaviour." The 
main question this type of analysis tries to answer is why certain relevant documents 
were not retrieved, or why certain non-relevant documents were retrieved. On this 
basis, recommendations for real improvements in systems can be made. 
Because of this trend it was decided that Ezperiment I should also be concerned— 
although on a very small scale—with the question of why processing of some queries 
171 
resulted in better performance achieved by the manual right-hand truncation than 
automatic word conflation and vice versa. 
6.6 Conclusions 
In Ezperiment I, characterized as a laboratory-type test, the following items were 
employed: 
• a test collection; 
• three different types of text representation (automatic stemming, manual right-
hand truncation, non-conflation); 
• a best-match searching strategy implemented in the INSTRUCT retrieval pack-
age. 
It has to be noted that this test collection represents the first such collection to be 
set up in a Slovene IR environment. The implementation of these items can serve as a 
starting point for a decision on how to collect data and how to present the results of 
Ezperiment I. These issues are detailed in the next chapter. 
172 
Chapter 7 
Evaluation of the Stemming 
Algorithm for Slovene IR — 
Analysis of Results 
7.1 Introduction 
The main objective of Ezperiment I was to test the following two hvpotheses in the 
context of a best-match environment: 
• HYPOTHESIS 1: There is a significant difference in retrieval effectiveness be-
tween queries which have been subjected to automatic word conflation and those 
which have not been stemmed. 
• HYPOTHESIS 2: There is no significant difference in retrieval effectiveness be-
tween queries which have been subjected to automatic word conflation and those 
which have been submitted to manual right-hand truncation, carried out by a 
trained intermediarv. 
So far, there has been no published evidence about similar experiments in Slovene IR 
svstems. It was also hoped that results of Ezperiment I would provide a framevrork to 
173 
test a multi-lingual approach to IR, using statistically-based techniques. 
This chapter will first describe how data was collected. On this basis, the analysis 
of results will be presented. 
7.2 Collection of data 
To obtain the data required for the evaluation, the follovving procedures were employed: 
• searching across a document collection, using three different types of query pro-
cessing; 
• retrieving a pool of document records that were then judged relevant or non-
relevant by requesters; 
• analyzing the obtained data, using basic measures of retrieval performance (i.e., 
recall and precision). 
These procedures are described below. 
7.2.1 Searching 
On the basis of two sets of queries (i.e., a set of queries written in natural language, 
and a set of queries processed by a trained intermediary) the following three different 
types of searching were carried out for each query: 
• best-match searching using automatic word conflation; 
• best-match searching using manual right-hand truncation; 
• best-match searching using unstemmed words. 
The execution of these search strategies yielded three sets of retrieved documents for 
each query. Since the test collection consisted of 48 different queries, a total of 144 
listings of retrieval output were produced. 
174 
7.2.2 A pool of retrieved documents 
Since a ranked output is provided by best-match searching, it is important for evaluation 
purposes to establish rank cutoff points. Having in mind the size of the document 
collection and the set of queries, a cutoff was defined at position 10. This means that, 
after INSTRUCT ranks documents in order of decreasing similaritv with the query, 
only the top 10 documents are examined further. 
Using this ranked-output cutoff procedure, a pool of retrieved documents was devel-
oped. Or, more preciselv, after each of the 48 queries was processed in three different 
ways, the first 10 documents of each search formed the pooled output. Thus, each 
query had an associated pool of retrieved documents. 
The list of pooled documents for each query was returned to the requester for rele-
vance assessments. Relevant documents were then compared with the top 10 documents 
from each type of search. This comparison was the starting-point for measuring the 
performance of the three different types of search. 
7.3 Analysis of results 
7.3.1 Recall and precision as measures of retrieval effectiveness 
On the basis of a comparison of the relevant items from the retrieved pool with the 
top 10 documents in the three different lists, the absolute figures were obtained for 
each query, and then recall values calculated (see Table 7.1). The precision values, as 
a percentage, can be extracted from absolute numbers if multiplied by 10. AH these 
values were then also averaged over the request set, for each type of search separately, 
in order to obtain the mean number of relevant documents, i.e., the aggregate precision 
and the aggregate recall. 
Table 7.1 shows that automatic stemming and manual right-hand truncation em-
ployed in best-match searching achieved almost the same results. While the automatic 
stemming resulted in a total of 302 relevant documents over a set of queries (or, in 
175 
Query 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
Total 
Stemmed 
rel % 
6 100 
9 90 
9 100 
7 100 
8 89 
8 89 
10 77 
3 100 
7 58 
7 78 
9 64 
10 67 
8 89 
9 82 
9 82 
5 83 
7 100 
7 100 
6 43 
10 50 
10 83 
6 86 
6 86 
3 75 
6 50 
7 87 
8 80 
7 54 
5 62 
1 100 
5 62 
3 60 
5 100 
4 67 
10 83 
9 100 
3 75 
4 67 
7 78 
5 83 
3 37 
2 100 
6 100 
5 62 
6 100 
2 50 
5 71 
5 71 
302 75 
Truncation 
rel % 
5 83 
9 90 
7 78 
7 100 
6 67 
8 89 
10 77 
3 100 
10 83 
6 67 
9 64 
10 67 
8 89 
9 82 
9 82 
5 83 
6 86 
6 86 
6 43 
9 45 
10 83 
6 86 
6 86 
4 100 
7 58 
7 87 
8 80 
8 62 
5 62 
1 100 
5 62 
3 60 
5 100 
4 67 
8 67 
9 100 
2 50 
3 50 
8 89 
4 67 
3 37 
2 100 
4 67 
6 75 
6 100 
3 75 
5 71 
7 100 
297 74 
Unstemmed 
rel % 
4 67 
4 40 
5 56 
4 57 
6 67 
2 22 
6 46 
1 33 
3 25 
2 22 
8 57 
7 47 
8 89 
7 64 
9 82 
3 50 
6 86 
6 86 
8 57 
10 50 
3 25 
3 43 
4 57 
1 25 
7 58 
6 75 
8 80 
8 62 
4 50 
0 0 
4 50 
2 40 
3 60 
3 50 
5 42 
5 65 
1 25 
4 67 
3 34 
4 67 
5 62 
0 0 
2 34 
5 62 
2 34 
3 75 
3 43 
3 43 
210 52 
Pooled 
ret/rel 
no. docs 
6 
10 
9 
7 
9 
9 
13 
3 
12 
9 
14 
15 
9 
11 
11 
6 
7 
7 
14 
20 
12 
7 
7 
4 
12 
8 
10 
13 
8 
1 
8 
5 
5 
6 
12 
9 
4 
6 
9 
6 
8 
2 
6 
8 
6 
4 
7 
7 
401 
Total 
distinct 
retrieved 
no. docs 
17 
17 
18 
15 
17 
20 
17 
17 
19 
18 
18 
18 
15 
15 
12 
17 
16 
14 
20 
21 
19 
17 
16 
21 
19 
15 
13 
18 
17 
16 
19 
17 
11 
18 
19 
15 
22 
20 
19 
18 
20 
17 
21 
17 
16 
20 
18 
21 
840 
Table 7.1: Number of relevant retrieved documents and aggregate recall for 48 queries, 
using three different types of search (cutoff 10) 
176 
other words, 6.3 relevant documents per query), the manual right-hand truncation per 
formed just slightly worse, in that it retrieved in total 297 relevant documents (or, in 
other words, 6.2 documents per query). On the other hand, non-conflation of terms in 
queries and documents resulted in a very poor performance. A set of only 210 relevant 
documents was retrieved by this type of search, or, in other words, only 4.4 documents 
per query. 
The above data can also be interpreted in terms of retrieval effectiveness, using the 
measure of recall. For each query and for ali queries in aggregate, recall percentage 
was calculated as: 
retrieved and relevant in database 
r-; : ; ;—; x 100 
pooled retrieved and relevant 
Table 7.1 reveals that there is a very little performance difference between automatic 
stemming and manual right-hand truncation. While the former achieved an average 
recall of 75%, the latter was slightly behind with an average recall value of 74%. The 
unstemmed processing of queries and documents performed the worst, achieving an 
average recall of only 52%. 
The precision figures can also be extracted from Table 7.1, i.e., using the mean 
number of relevant documents. Thus, precision for automatic stemming was 63%, for 
the manual right-hand truncation 62%, and for non-conflation 44%. The average recall 
and precision values for the three different types of text processing are summarized in 
Table 7.2. 
The results in the tables above indicate the following: 
• both types of word conflation (i.e., automatic stemming and right-hand trunca 
tion) perform much better than non-conflation; i.e., they are able to retrieve a 
higher percentage of relevant documents from a database; 
• there is a very little difference between those two types of text representation; both 
the automatic stemming and manual right-hand truncation are able to retrieve a 
similar percentage of relevant documents from a database. 
177 
Retrieval effectiveness 
RECALL 
PRECISION 
Stemmed 
75 
63 
Truncation 
74 
62 
Unstemmed 
52 
44 
Table 7.2: The average recall and precision values of three different types of search 
(cutofF 10) 
In order to investigate whether the cutofF factor influenced the performance results, the 
additional ranked-cutoff procedure at the position 5 (i.e., the top 5 documents) was 
emploved. Performance differences between three different types of search at cutofF 5 
are summarized in Table 7.3. 
Results 
No. of rel.docs. 
Mean of relevants 
Recall 
Precision 
Stemmed 
182 
3.8 
45 
76 
Truncation 
183 
3.8 
45 
76 
Unstemmed 
138 
2.9 
34 
58 
Table 7.3: Results from three different types of search at cutofF 5 
Using the lower cutoff position (i.e., the top 5 documents)—with the same pool 
of retrieved documents—means that the average recall numbers in Table 7.3 are lower 
than in Table 7.2. However, the most important fact is that the performance differences 
between three types of text representation remain more or less unchanged. This means 
that automatic stemming and right-hand truncation exhibit a superior performance 
over unstemmed processing of words, and that the change in cutofF does not affect the 
relative performance of the three methods. 
178 
In order to prove, firstly, that both automatic stemming and manual right-hand 
truncation perform equally well, and, secondly, that they are superior to unstemmed 
word processing, a statistical significance test on the differences has to be carried out. 
In this experiment, the follovving two tests were emploved: 
• the sign test; 
• the Kendall coefficient of concordance, W. 
Since no difference was found between results obtained at cutoff 10 and cutoff 5, data 
from the former cutoff point was used for both significance tests. 
7.3.2 Significance tests 
The sign test 
The sign test gets its name from the fact that it is based upon the direction of differences 
between two measures. The sign test is applicable to the comparison of two related 
samples. It is particularly useful for research in which it is possible to determine, for 
each pair of observations, which is the "greater" (Siegel and Castellan, 1988). 
In applying the sign test, the focus is on the direction of the difference between 
a pair, noting whether the sign of the difference is positive (+) or negative (-). In 
addition, a "tie" (0) occurs when two values are equal. In this experiment, three pairs 
of observations were investigated for their significance difference: 
• the performance of automatic stemming vs. the manual right-hand truncation; 
• the performance of automatic stemming vs. non-conflation; 
• the performance of the manual right-hand truncation vs. non-conflation. 
For each pair, the sign test required the introduction of the null hypothesis (Ho) and 
alternative hypothesis (Hi). They can be generalized as follows: 
179 
• HQ : a difference between two types of text representation is zero (e.g., the appli-
cation of automatic word conflation and manual right-hand truncation results in 
a retrieval of a similar percentage of relevant documents); 
• Hi : a difference between two types of text representation is positive (e.g., the 
employment of the automatic stemming results in a larger number of relevant 
documents than manual right-hand truncation). 
Two additional details are required for the sign test, i.e., the significance level (a) and 
the number of couples (N) under observation. While the significance level (a) was 
defined at 0.05, the number of couples (./V) was equal to the number of queries, i.e., N 
= 48. If a matched pair showed no difference (i.e., the difference was zero and had no 
sign), it was dropped from the analysis and N was reduced accordingly. 
Since H\ predicts the direction of the differences, the rejection region was defined 
to be one-tailed. It consists of ali values of x (where x is the number of +s since the 
prediction for H\ is that positive differences will predominate) for which the one-tailed 
probability (p) of occurrence when Ho is true is equal to or less than a = 0.05. If N 
is stili larger than 35, a method for large samples has to be applied (see Siegel and 
Castellan, 1988), using the following formula: 
_ 2z+l - N 
VN 
Results of the significance test on these three pairs of observations, using the sign 
test, are presented in Table 7.4 and discussed below. 
Automatic stemming vs. manual right-hand truncation. Table 7.4 shows that 
automatic stemming was more successful in 12 observations and less successful in 8 
observations than manual right-hand truncation. The number of tied cases was 28; 
consequently N was reduced to 20. Appendix Table D in Siegel and Castellan (1988) 
shovvs that for N = 20 the probability of observing x >= 12 has a one-tailed probability 
when HQ is true of 0.868. Since this value is not in the region of rejection for a = 0.05, 
the decision is to reject Hi in favour of HQ. Thus, this test has shown that there is 
180 
Comparison 
Stemming 
vs. 
truncation 
Stemming 
vs. 
non-conflation 
Truncation 
vs. 
non-conflation 
+ 
12 
37 
34 
-
8 
5 
4 
0 
28 
6 
10 
z 
-
4.78 
4.70 
P 
0.86800 
0.00003 
0.00003 
Table 7.4: Frequency distribution of the direction of difFerences between three pairs of 
text representation (with z and p values) 
no significance difference between automatic word conflation and manual right-hand 
truncation. 
Automatic stemming vs. non-conflation. It can be seen from Table 7.4 that 
automatic stemming performs better than non-conflation in 37 cases. Its performance 
was less successful in only 5 cases, and equal values were obtained in 6 cases. It follows 
that N has to be reduced to 42. Since N is stili larger than 35, a method for large 
samples was applied resulting in a value of z = 4.78. Reference to Table A in Siegel 
and Castellan (1988) reveals that the probability z >= 4.78 when Ho is true is 0.00003. 
Since 0.00003 is smaller than a — 0.05, the decision was to reject the null hypothesis 
in favour of the alternative hypothesis. Thus, the sign test proved that there is a 
significant performance difference between automatic stemming and non-conflation. 
Or, in other words, automatic stemming is able to produce significantly better results 
than non-conflated processing of words. 
Manual right-hand truncation vs. non-conflation. Table 7.4 shows that manual 
right-hand truncation performs better than non-conflation in 34 cases. Its performance 
was less effective in only 4 cases, and "ties" were obtained in 10 cases. Consequently, 
181 
N was reduced to 38 and the method for large samples used again. Applying the above 
formula, a value of z = 4.70 was produced. Reference to Table A in Siegel and Castellan 
(1988) revealed that the probabilitv z > = 4.70 when H0 is true is 0.00003. 
Thus, the decision was again to reject the null hvpothesis in favour of alternative 
hvpothesis. Or, in other words, the manual removal of endings from words in queries 
contributes significantlv to better performance results than non-conflation. 
Conclusions. To summarize the results of the above three tests: 
• there is no significant difference in retrieval performance between automatic stem-
ming and manual right-hand truncation; both types of text representation are 
equally successful in retrieving relevant documents in the Slovene document col-
lection; 
• there is a significant difference in retrieval performance between word conflation 
and non-conflation; automatic word conflation and manual right-hand trunca 
tion are much more successful in retrieving relevant documents from the Slovene 
document collection than non-conflation. 
However, since only pairs of variables were compared in the sign test, there was a need 
to introduce another significance test which is able to differentiate among three or more 
variables. This test is known as the Kendall coefficient of concordance, W. 
The Kendall Coefficient of Concordance, W 
The use of this test requires k sets of rankings of N objects or individuals. On this 
basis, the association among them can be determined, using the Kendall coefficient of 
concordance, W. W expresses the degree of association among k such variables, that 
is, the association between the k sets of rankings. Or, in other words, W measures the 
extent to which k rankings of the same set of N objects are in agreement with each 
other (Siegel and Castellan, 1988). The test, therefore, consists of the following stages: 
182 
1. to flnd the value of W, i.e., to determine its significance (when W is significant, 
then a high agreement exists); 
2. on the basis of a high agreement, W can be interpreted (i.e., a decision about the 
"best estimate" can be made). 
Thus, in this experiment, the first step was to determine agreement among users (i.e., 
48 queries) on the association among three different types of search. If that agreement 
exists then the best type of search will be defined. Using a set of 48 queries, the 
following hypotheses were introduced: 
• HQ : There is no agreement among the users on the association among three 
different types of search. 
• H\ : There is an agreement on the association among three different types of 
search. 
The significance level (a) was again defined at 0.05 (i.e., a = 0.05). To compute W the 
data is first arranged into a k X N table with each row representing the ranks assigned 
by a particular judge to the N objects. These ranks (i.e., a search retrieving the largest 
number of relevant documents has a rank 1, etc.) are assigned to ali 48 queries. On 
this basis, the sum of ranks, R{, and the average rank, Ri, are defined in each column, 
as shown in Table 7.5. 
Type of search 
Stemmed 
Truncation 
Unstemmed 
Sum of ranks 
79 
79 
128 
Average rank 
1.6 
1.6 
2.7 
Table 7.5: The sum of ranks and the average rank for each type of search 
183 
To obtain the mean value of Ri, the sum of ranks (286) is divided by N (3); the 
mean value is equal to 95.3. Each of the R{ may then be expressed as a deviation 
from the mean value. The larger these deviations, the greater the degree of association 
among the k sets of ranks. Finallv, s, the sum of squares of these deviations is found 
(s = 1,600). Knowing these values, the value of W was calculated, using the following 
formula (see Siegel and Castellan, 1988): 
W= t 
±k2(N3-N) 
The value obtained for W was 0.347. This value for W was then compared with 
the critical values to determine whether there is a statistically significant degree of 
agreement between the k rankings. 
To test the significance of the value of W (W = 0.347), the probability associated 
with this value was determined. This was done by first calculating the Chi Square value 
from, 
x
2 = k(N -1)W 
Thus, the calculated \2 value is 33.3. Referring to the table of critical values of Chi 
Square (Siegel and Castellan, 1988), we find that the probability associated with x2 iS 
less than 0.001 (p < 0.001). 
Since p < 0.05 the HQ can be rejected in favour of H\. Or, in other words, the value 
of W shovvs that there is a high degree of agreement among users on the association 
among the different types of search. This agreement is much higher than it would be 
by chance. 
It follows that users were applying essentially the same standard in ranking the N 
objects (i.e., different types of search) under study. To find out which type of search 
performed the best, the sum or average of ranks can be used. The "best estimate" 
is associated, in a certain sense, with a least-squares estimate. In this experiment, 
the best retrieval performance can, therefore, be assigned to automatic stemming and 
manual right-hand truncation, for in each of their cases the sum of ranks are equal 
R1 = R2 = 79 (or, the average sum = 1.6), or, in other vrords, the lowest value 
184 
observed. 
To summarize, the significance test, using the Kendall coefficient value of concor-
dance, W, has again demonstrated the following: 
• both types of word conflation (i.e., automatic stemming and manual right-hand 
truncation) perform significantly better in a Slovene IR environment than non-
conflation; 
• there is no significant difference between automatic stemming and manual right-
hand truncation; both types of text representation are able to retrieve a similar 
percentage of relevant documents from the Slovene document coUection. 
In this context, it is interesting to emphasize the quite huge performance difference 
between unstemmed and stemmed (automatic conflation/manual right-hand trunca 
tion) processing of words. This difference is much larger than the performance dif 
ference obtained from experiments carried out on English document collections (e.g., 
Lennon et al., 1981; Harman, 1991). For example, the results of experimental tests 
by Harman (1991) revealed no substantial difference between full word retrieval and 
retrieval using suffixing. Although individual queries were affected by stemming, the 
number of queries with improved performance tended to equal the number with poorer 
performance, thereby resulting in little overall change for the entire test coUection. 
In order to prove that these performance differences between Slovene and English 
IR systems reflect the richer Slovene morphology, a similar test to that carried out by 
Harman (1991) was repeated in the context of Experiment I. Two types of best-match 
search, i.e., a retrieval based on suffixing, and a full word retrieval, were employed 
on the English database. This database was developed as part of a test coUection for 
Experiment II, and is described in detail in Chapter 8. It is important to note that 
this coUection consisted of an English translation of the Slovene text, and therefore 
contained identical documents and queries as used in Experiment I. The employment 
of unstemmed and stemmed searching on the English database (referred to as ENGL) 
produced the following results as presented in Table 7.6; these results are compared 
185 
with results of the similar search on the Slovene database (referred to as SLOV). 
Type of search 
Stemmed 
Unstemmed 
SLOV 
302 
210 
ENGL 
248 
234 
Table 7.6: Number of relevant documents retrieved by stemmed and unstemmed 
searches on the English and Slovene databases 
The results in Table 7.6 clearly show that the employment of the English stem 
ming algorithm produces only a slight improvement over non-conflation (248 relevant 
documents vs. 234 relevant documents). Moreover, the application of the sign test 
reveals no significant performance difference between stemmed and unstemmed text 
representation in an English information retrieval system, as evident in Table 7.7. 
Comparison 
Stemming 
vs. 
non-conflation 
+ 
17 
-
13 
0 
18 
P 
0.81900 
Table 7.7: Comparison between automatic stemming and non-conflation in the English 
database (results of the sign test) 
Table 7.7 shows that automatic stemming performs better than non-conflation in 
17 cases. Its performance was less effective in 13 cases; the number of tied cases was 
18. Ali tied cases were dropped from further analysis; consequently N was reduced 
to 30. Appendix Table D in Siegel and Castellan (1988) shows that for N = 30 the 
probability of observing x >= 17 has a one-tailed probability when HQ is true of 0.819. 
186 
Since this value is not in the region of rejection for a = 0.05, the decision was to reject 
Hi in favour of Ho. Thus, this test has shown that there is no significant difference 
between these two types of text representation. 
These results are entirely analogous to those presented by Harman (1991). There is 
no doubt that a language with a rich morphology (e.g., Slovene) correlates with a larger 
distinction between retrieval performance based on sufRxing, and retrieval performance 
based on using full words. This again indicates the importance of an effective Slovene 
stemming algorithm. 
7.3.3 Additional comparison of automatic stemming and manual right-
hand truncation 
In order to illustrate that these two types of search exhibit no significant performance 
difference in terms of retrieval effectiveness, some additional quantitative data can 
be presented. For example, Table 7.8 shows the frequency distribution of different 
documents, retrieved by automatic stemming and by manual right-hand truncation. 
Number of different docs. 
0 
1 
2 
3 
4 
5 
6 
Total 
Freq 
17 
7 
11 
8 
3 
1 
1 
48 
Table 7.8: Frequency distribution of different documents, retrieved by automatic stem 
ming and manual right-hand truncation 
187 
The most striking feature of Table 7.8 is the fact that processing of 17 queries 
(35.4%), using two different types of search (i.e., automatic stemming and manual right-
hand truncation) resulted in the same top 10 documents. In addition, the ranking was 
entirely identical in 14 cases (29.2%). 
In order to find out why there are some differences—although very little—between 
automatic stemming and manual right-hand truncation in the Slovene IR system, a 
simple qualitative investigation was carried out. Six queries were used as a sample for 
this type of evaluation, as illustrated in Table 7.9. 
Query no. 
18 
25 
35 
39 
40 
48 
Stemmed 
7 
6 
10 
7 
5 
5 
Truncation 
6 
7 
8 
8 
4 
7 
Table 7.9: Number of relevant documents retrieved by two types of search for six queries 
While processing of queries 18, 35, 40 resulted in better performance of automatic 
stemming, processing of queries 25, 39, 48 resulted in a larger number of relevant doc 
uments retrieved by manual right-hand truncation. A detailed analysis of these six 
queries is presented below. First, queries where the stemming algorithm performed 
better will be analyzed, followed by the queries where the manual processing of terms 
performed better. The English translation of each query is given in square brackets. 
QUERY 18: 
Teorija in modeli samoupravnega javnega komuniciranja 
[Theory and models of self-management mass communication] 
188 
A manual removal of endings, carried out by a trained intermediary, returned the 
following stems: 
teor? model? samoupravn? javn? komunic? 
It has been emphasized in Chapter 3 that the Slovene morphology is not only character-
ized by a large number of endings, but also by alterations which are carried out during 
word formation. If we consider the stem KOMUNIC* from the above list, its related 
words can also be KOMUNIKACIJA, KOMUNIKACIJAM, etc. Since the trained in-
termediary did not take into account these variations (-IC —• -IK), processing of the 
above query resulted in a smaller number of relevant documents (6) than the number 
obtained by the automatic conflation (7). The stemming algorithm has produced the 
stem KOMUNI* which was able to retrieve some other related terms from the relevant 
documents. 
QUERY 35: 
Citatna analiza (analiza citatov) 
[Citation analysis] 
The manual removal of suffixes, carried out by a trained intermediary resulted in the 
following list of stems: 
citat? analiz? 
The frequency of alterations which are conducted during word formation in Slovene 
is again demonstrated in this example. The related forms of the term CITAT include 
the following: CITIRANJE, CITIRANOST, etc. Consequently, the manual removal of 
suffixes was not capable of taking into account these alterations. On the other hand, 
the stemming algorithm can successfully process words such as CITATI, CITIRANJE, 
CITIRANOST, etc. and conflate them into CITAT, after using the recoding rule -IR 
—• -AT. As a consequence, the stemming algorithm retrieved a larger number of rele 
vant documents (10) than did manual removal of suffixes (8). 
189 
QUERY 40: 
Pomen mikrofilma in mikrofilmanja v knjižnicah in INDOK centrih 
[Microfilm and microfilming in libraries and INDOC centres (centers; infor-
mation and documentation services)] 
The trained intermediary produced the following list of stems: 
Pomen? mikrofilm? knjižni? INDOK? centr? 
Whilst the trained intermediary conflated the word CENTRIH to CENTR*, the stem-
ming algorithm produced the stem CENT*. This difference in conflation was the basic 
reason for the slightly better performance achieved by automatic conflation (5 relevant 
documents) than by manual right-hand truncation (4 relevant documents). Again, 
the stemming algorithm has not ignored the Slovene linguistic characteristics, i.e., the 
variants of the stem CENT* can be either words such as CENTER or terms such as 
CENTRI, CENTRIH, etc. 
The above examples show that some of the very important linguistic rules of the 
Slovene language are included in the stemming algorithm. Since the professional inter-
mediary was either not aware of these rules or they were not applicable to the manual 
right-hand truncation, the stemming algorithm was superior to the manual right-hand 
truncation in 12 queries. However, in eight cases the stemming algorithm was less suc-
cessful in retrieving relevant documents within top 10 ranked documents. Some of the 
reasons why its performance was less effective are described below using a sample of 
three queries. 
QUERY 25: 
Izobraževanje (šolanje, vzgoja) knjižnièarskih kadrov (bibliotekarjev, bib-
lioteènih kadrov), uèni naèrti in strokovni izpiti 
190 
[Training and education of librarians (library staff, personnel) - educational 
programmes (curriculums) and examination regulations] 
From this query, the processing of two words had a crucial influence on the retrieval 
results. While the trained intermediary truncated terms ŠOLANJE and STROKOVNI 
to ŠOLAN* and STROKOV*, the stemming algorithm generated the stems ŠOL* and 
STROK* in order to retrieve more related words, i.e., to increase recall. However, the 
"strong" removal of suffixes resulted in only 6 relevant documents retrieved by the stem 
ming algorithm; the manual truncation returned 7 relevant documents. The following 
are some examples of undesirable phrases which were brought in by the overstem-
ming: ŠOLSKA KNJIŽNICA (SCHOOL LIBRARY), OSNOVNA ŠOLA (PRIMARY 
SCHOOL), ŠOLSKO LETO (ACADEMIC YEAR), and STROKOVNE KNJIŽNICE 
(SPECIAL LIBRARIES). These phrases are by no means related either to the phrase 
ŠOLANJE BIBLIOTEKARJEV (EDUCATION OF LIBRARIANS) or to the phrase 
STROKOVNI IZPITI (EXAMINATION REGULATIONS). 
QUERY 39: 
Analiza uporabnikov v knjižnicah 
[User studies (surveys) in libraries] 
The processing of this query, in particular the word UPORABNIKOV, shares similar 
characteristics to the previous query. The stemming algorithm has again, in order to 
increase recall, produced a string with fewer characters than one created by a trained 
intermediary, i.e., UPORAB* vs. UPORABN*. The consequence was the smaller 
number of relevant documents (7) than the number obtained by manual processing (8). 
Overstemming can be again illustrated by the retrieval of the following two phrases 
which are unrelated to the phrase ANALIZA UPORABNIKOV (USER STUDIES): 
UPORABA PROGRAMSKE OPREME (SOFTVVARE USE), and UPORABA CITA 
TOV (CITATION USE). 
191 
QUERY 48: 
Centralni katalog - avtomatizacija 
[Union catalogue - automation] 
Manual right-hand truncation, carried out by the trained intermediarv resulted in the 
list of following stems: 
centraln? katalog? avtomat? 
The emplovment of the stemming algorithm resulted in the following list of stems: 
cent* katal* avtomat* 
In this query, the stem CENT* was again produced by the stemming algorithm. Whilst 
its employment in query number 40 (as described above) resulted in better retrieval 
performance, the precision was reduced in this query. Moreover, the reduction of CEN 
TRALNI to CENT* can be considered as an example of overstemming. The stem 
CENT* also retrieved phrases such as RAZISKOVALNI CENTER (RESEARCH CEN 
TRE), INDOK CENTER (INDOC CENTRE), i.e., phrases which are by no means 
related to the phrase CENTRALNI KATALOG (UNION CATALOGUE). As a conse-
quence, manual right-hand truncation retrieved a larger number of relevant documents 
(7) than automatic stemming (5). 
A comparison of queries 40 and 48 illustrates how it is sometimes difficult to ob-
tain a balance between over- and understemming in designing a stemming algorithm. 
However, as shown in previous sections, the Slovene stemming algorithm has produced 
good retrieval results in these experimental tests, i.e., performance differences between 
automatic stemming and manual right-hand truncation were not significant at aH. 
7.4 Conclusions 
The main objective of Experiment I was to test whether automatic word conflation 
can be introduced into Slovene information retrieval systems with no average loss of 
192 
performance, thus allowing easier user access to the system. The experiment was also 
carried out to provide a basis for introducing other statistically-based techniques into 
a Slovene IR environment; so far, these techniques have been tested mainly on English 
language document collections. 
The results of experimental testing have confirmed two main hvpotheses of Ezperi-
ment I: 
• there is a significant performance difference between automatic word confiation 
and unstemmed processing of words in queries and documents; 
• there is no significant performance difference between automatic stemming and 
manual right-hand truncation carried out by a trained intermediary. 
It follows that one of the important components of an IR system, i.e., word con 
fiation, can be automated in Slovene IR systems with no average loss of performance. 
If there are some signs that stemming does not make enough difference to retrieval 
of English documents (Harman, 1991, Keen, 1991b), the results above confirmed that 
devising an effective automatic means of stemming in a Slovene IR environment can 
significantly increase retrieval effectiveness. 
Having obtained good performance results with the employment of automatic word 
confiation procedures, the next experiment, Experiment II (as described in Chapter 8) 
was carried out. Its main objective was to test the performance of statistically-based 
IR techniques in two different languages, i.e., Slovene and English. It was hoped that 
the results of this experiment could serve as an important contribution towards solving 
the problem of a multi-lingual approach to information retrieval. 
193 
Chapter 8 
Multi-Lingual Approach to 
Document Retrieval 
8.1 Introduction 
If statistically-based techniques appear to work well for English (Willett, 1988a) there 
is no a priori reason why they should not work equally well for another language, in 
this èase, Slovene. However, a comprehensive analysis, comparing English and Slovene 
retrieval performance, is required to confirm the correctness of the above assumption. 
Therefore, the main problem to be investigated in Chapter 8 is contained in the 
following two questions: 
1. Are statistically-based techniques applicable to Slovene IR systems? 
2. Could statistically-based techniques provide a framework for developing a multi-
lingual IR system? 
Experiment II was designed and conducted in order to provide answers to the above two 
questions. The methodology and results of Experiment II are described in this chapter. 
First, the purpose and background for the experimental test are outlined. The following 
section on methodology consists of two sub-sections: an outline of the test environment; 
194 
and a description of the test procedures. The analysis and presentation of results serve 
as a basis for conclusions and suggestions for further work. 
8.2 Purpose of the experiment 
Starting from the above notions, the following three main hypotheses were introduced 
in Ezperiment II: 
• HYPOTHESIS 1: Processing of the English documents and queries within a best-
match environment will produce more or less identical hits to those retrieved from 
the Slovene document collection. 
• HYPOTHESIS 2: There is no significant performance difference in retrieval effec-
tiveness between Slovene queries and documents and their equivalents, translated 
in English. 
• HYPOTHESIS 3: Processing of English and Slovene stems, using a string simi-
laritv measure, will produce a similar number of semanticallv related terms. 
In order to carry out Ezperiment II, the methodology from Ezperiment I was employed. 
In other words, a need to control ali variables of the experiment as far as possible 
(Robertson, 1981) dictated the application of a laboratory test. The decision to use 
this type of evaluation, rather than the operational test, meant that Ezperiment II was 
neither concerned with the performance of the complete IR system, nor were any of 
the user-oriented variables (e.g., human factors) taken into account. 
Experimental testing required the following implementation of variables from Ez 
periment I: 
• enhancement of a test collection with documents and queries translated into En 
glish, and with additional relevance judgments; 
• design and development of the English version of INSTRUCT for a PC; 
• searching of documents in the English database; 
195 
• identification of semantically related terms from the English and Slovene dictio-
nary component of the inverted file; 
• evaluation of results. 
Before implementation of the above variables is described in detail, a brief background 
for Ezperiment II is presented in the next section. 
8.3 Background for the experiment 
It has already been pointed out for Ezperiment I that, so far, there has been no 
published evidence of similar experiments carried out in Slovene IR systems. This 
comment can be extended to Ezperiment II. In other words, a comparison of the Slovene 
and English IR performances, based on the application of statistical IR techniques, can 
be described as a pioneering experiment in this area. 
Moreover, the employment of statistically-based IR techniques in multi-lingual in-
formation systems is not only the first such experiment carried out in Slovenia, but also 
one of very few similar projects reported, so far, worldwide. One of the main reasons 
for the low interest in developing multi-lingual IR systems in general—not only by ap-
plying automatic methods—is the widespread use of the English language. Evidence 
shows that English is the international language for the communication of scientific and 
technical information. It is therefore not altogether surprising that large United States 
or British information services have not felt an urgent need to provide multi-lingual 
output. 
Pressure for multi-lingual information systems has tended to come from countries 
or regions using major languages other than English. Countries within the EC—in 
particular Belgium—and other multi-lingual countries (Canada, USA, Yugoslavia, etc.) 
are doubtlessly among those in which end-users could benefit from a multi-lingual 
approach to document retrieval. 
So far, the following two advantages have helped end-users in these countries to 
overcome the language barrier in document retrieval: 
196 
• a widespread knowledge of English in the scientific community; 
• the ability of the professional intermediary to search databases using an English 
thesaurus or even uncontrolled terms. 
In these countries, most information centres of sufficient sophistication to require access 
to large internationally available databases are likely to have at least one information 
intermediary with sufficient knowledge of English to carry out searches. Nevertheless, 
in spite of the dominant position of English there are undoubtedly areas where a multi-
lingual approach to information systems is highly desirable. Apart from stressing a need 
for promoting information flows through wide multi-national involvement in global 
systems and networks, there is one particular factor that is gaining in importance. 
This factor—known as the end-usei—seems to be increasing efforts by information 
researchers to develop techniques to overcome natural language difficulties. In other 
words, the provision of end-user searching facilities in document retrieval systems has 
been recognized as the only way to remove the barrier between the original source of a 
query and the query's answer. 
For an IR system to be accepted by end-users, the following requirements must be 
met (VVolpert, 1983): speed, ease of use, high recall, relevance, and flexibility. In the 
field of language needs, the possibility for end-users to input a query and to receive 
answers in a language with which they have a high degree of familiarity, is doubtlessly 
one of the top requirements. 
In the literature, the following four specific techniques for overcoming language 
problems in information transfer were considered: 
• automatic translation of unprocessed text; 
• automatic translation of preprocessed text; 
• use of multi-lingual thesauri; 
• switching languages. 
197 
Although automatic translations of both unprocessed and preprocessed text were not 
developed primarily for IR purposes, their potential use for abstracting services made 
them of particular importance in the late Seventies (Dubois, 1979). Automatic trans-
lation of unprocessed text is usually based on the following elements: a terminological 
dictionary (often a morpheme dictionary) to identify aH words that are likely to be 
encountered; a transformation grammar, containing the rules of the source and the 
target languages together with translation rules; and a set of processing algorithms. 
The main problems with such procedures have proved to be ambiguities caused by ho-
mographs and prepositional reference. According to Dubois (1979), the application of 
these techniques might be useful only in highly specialized fields. 
An attempt to eliminate the unreliability of the above technique was made by 
automatic translation of preprocessed text, using a limited number of syntactical forms 
or specified grammatical rules and a pre-established vocabulary. Again, this method 
can only be applied in fairly narrow areas, e.g., the study of certain industrial or 
chemical processes. This approach is implemented in TITUS, a system developed by 
the Institut Textile de France (Dubois, 1979). 
The third technique is that of a multi-lingual thesaurus. Because of the absence of 
syntactical relationships, indexing in this èase is straightforward. However, the largest 
problem seems to be the initial development of the thesaurus and its maintenance. 
Recent experience supports the view that building a multi-lingual thesaurus in several 
languages at the same tirne is a better method than translating an existing thesaurus. 
This is because the latter procedure generates homographs on a scale whose resolution 
leads to a weakening of thesaural structures in the target languages. This was confirmed 
experimentally by Sager et al. (1982) in building an integrated multi-lingual thesaurus 
for the social sciences. The authors emphasized the importance of planning thesauri 
multi-lingually from the start. Every addition of another language requires the total 
reassessment of ali descriptors and it is therefore advisable to construct a thesaurus 
from the outset with a view to the various languages likely to be required in the future. 
It is interesting to note that such an approach was used in building the Serbo-Croat 
and English multi-lingual thesaurus for a database containing documents on law and 
198 
legislation (Martinoviè, 1985). 
The use of sivitching languages permits the conversion of indexing terms used by 
a given centre into the terms used by a number of alternative centres (Dubois, 1979). 
With two or three indexing languages, tables of equivalents or appropriate conversion 
algorithms could link each term in one language with its conceptual equivalent(s) in 
another. 
Although very little evidence can be found about the retrieval performance of the 
techniques described above, it seems that the majority of the IR svstems interested in 
providing output in more than one language are heading towards the use of a multi-
lingual thesaurus. However, according to Rolland-Thomas and Mercure (1989), there 
is stili a very modest account of the ongoing research and writing on subject access in 
multi-lingual IR systems. The same comments can be made for multi-lingual OPACs, 
which have yet to appear. 
8.3.1 Statistically-based techniques in multi-lingual IR systems 
It has often been emphasized in previous chapters that statistically-based IR techniques 
allow greater end-user involvement in the searching process. It follows that these tech-
niques could potentially be very attractive tools in designing multi-lingual IR systems. 
The very little evidence about using statistically-based techniques in multi-lingual 
information systems—as revealed by a citation search in the LISA database—can there-
fore be considered as unexpected. Although one of the first experiments in this area 
was carried out by Salton (1969) almost three decades ago, very few similar reports 
can be traced afterwards in the literature. 
Salton's experiment (1969) was a part of the SMART project. His starting point 
was the good experimental results that had been obtained by the employment of fully 
automatic text processing methods using relatively simple linguistic tools. These meth 
ods had been shown to be as effective for purposes of document indexing, classification, 
search, and retrieval as the more elaborate manual methods used in practice. Since aH 
tests were carried out entirely with English language queries and documents, Salton 
199 
studied the extension of the SMART procedures to German language materials. A 
multi-lingual thesaurus was used for the analysis of documents and search requests, 
and tools were provided which made it possible to process English language documents 
against German queries and vice versa. The evaluation of these methods showed that 
the effectiveness of the mixed language processing was approximately equivalent to that 
of the standard process operating within a single language. 
At the heart of Salton's experiment was a synonym dictionary, or thesaurus, which 
was used to recognize synonyms by replacing each word stem by one or more concept 
numbers; these concept numbers then served as content identifiers instead of the original 
word stems. A multi-lingual thesaurus was produced by manually translating into 
German an originally available English version. It is reported by Salton (1969) that 
the use of a thesaurus look-up process improves retrieval effectiveness by about 10% in 
both recall and precision. 
The second known experiment in this area, carried out by Field (1977), investigated 
automatic multi-lingual indexing. French and English systems were compared during 
their automatic generation of thesaurus terms. Both systems produced successful and 
also equivalent results. On this basis, Field (1977) predicted a promising future for 
automatic indexing systems which employ multi-lingual thesaurus. 
Despite this optimism as expressed by Salton (1969) and Field (1977) no similar 
experimental tests have been reported in the last decade. Moreover, Dubois (1979) 
claimed that these results were not clear enough to warrant serious consideration of 
these techniques. His conclusion was reinforced by the paucity of concrete applications 
of monolingual automatic indexing. 
It has already been pointed out that the research work on the use of statistically-
based IR techniques is now beginning to be reflected in operational systems, in partic-
ular in the English language-based IR environment. In addition, some of the important 
components of these techniques (e.g., automatic word conflation) have indicated the 
possibility for implementation of this approach in Slovene IR systems. Successful perfor-
mance results obtained by the employment of these techniques in two different languages 
200 
have stimulated the consideration of automatic statistical methods for a multi-lingual 
information system. There is no doubt that end-users in Slovenia being surrounded by 
document collections written in different languages (i.e., Yugoslav languages and other 
major European languages) could benefit in the long term from the results of such an 
experiment. 
8.4 Methodology 
8.4.1 The test environment 
The test environment in Ezperiment II consisted of the follovving components: 
• a test collection; 
• two versions (English and Slovene) of the information retrieval package INSTRUCT; 
• a term expansion module, based on a measure of string similarity (trigrams) 
between a specified query term and each of the terms in the dictionary file; 
• a best-match searching strategy. 
Ali of the new components are briefly outlined in the next sections. 
Test collection 
A test collection consisting of the Slovene documents, queries, and relevance judge-
ments was set up as a part of Ezperiment I. However, for purposes of the multi-lingual 
test, an English translation of the Slovene texts was required. Therefore the original 
test collection was extended with documents and queries translated into English. In 
addition, matching of English queries with the English documents required relevance 
judgments to be carried out by the same group of users as in Ezperiment I. A brief 
description of the English-based components of the test collection is presented below. 
201 
Documents. Most of the articles in both journals which were used for the Slo-
vene document collection were also accompanied by abstracts, translated into English. 
Therefore, the process of building the English document collection was straightforward. 
The only problem was keyboarding which required three weeks work by the author. Be-
cause of the English automatic spell-checker, no particular additional checking of text 
was needed. 
The following is an example of a Slovene unit (referred to as SLOV) and its English 
equivalent (referred to as ENGL): 
SLOV: 
Sodobni trendi v iskanju dokumentov 
* Popoviè, M.: Knjižnica, 34(1990)1/2, str. 9-31 
Trend ekstenzivne rasti bibliografskih in tekstovnih podatkovnih zbirk ter razvoj na po 
droèju hardverske in softverske tehnologije postavljata vedno bolj v ospredje sodobne, nekon-
vencionalne sisteme za iskanje informacij. Predstavniki teh sistemov so že dokazali, da ne 
omogoèajo le veèjega števila relevantnih zadetkov, temveè tudi samostojno iskanje uporabnikov 
po podatkovnih zbirkah. V èlanku so prikazane naslednje sodobne tehnike iskanja informa 
cij, ki temeljijo na uporabi statistiènih metod: avtomatsko indeksiranje, iskanje podatkov po 
naèelu optimalnega primerjanja in ponderiranje gesel. Hkrati so opisane tudi nekatere jezikovno 
odvisne procedure, ki so bile doslej razvite tudi že za slovenski jezik (npr. algoritem za av 
tomatsko zlivanje besed). 
ENGL: 
Current trends in document retrieval 
* Popoviè, M.: Knjižnica, 34(1990)1/2, pp. 9-31 
The increasing use of bibliographic and text retrieval svstems and developments in hardvvare 
and software technology will lead to a growing interest in advanced retrieval svstems. These 
svstems have alreadv been shown to be able to retrieve larger amounts of relevant material than 
conventional svstems and to replace trained intermediary by end-users unexperienced at the 
search process. In this article, the following statistically-based methods of advanced information 
retrieval are described: automatic indexing, best-match searching, term weighting. Language 
dependent procedures developed so far for Slovene are also briefly outlined (e.g., automatic 
word conflation algorithm). 
The implementation of the English version of the INSTRUCT package required the 
processing of document collections. The English document collection was first filtered 
by the application of the list of English stop-words, followed by Porter's stemming 
202 
algorithm. After this complex processing was completed, the Slovene and English 
document coUections exhibited similar characteristics in statistical terms. Table 8.1 
shows some quantitative features of both coUections, first after removal of stop-words, 
and secondly, after applving the stemming algorithms. 
Quantitative characteristics 
Number of word types (stop-words deleted) 
Number of stem types 
SLOV 
8,602 
2,957 
ENGL 
4,756 
3,012 
Table 8.1: Quantitative characteristics of the Slovene and English document coUections 
Table 8.1 points out the following main features of both document coUections: 
• The removal of stop-words preserves the difference in morphological complexity 
betvreen Slovene and English; because of the morphological richness, the Slovene 
document collection contains a much larger number of original word types. 
• The application of stemming algorithms reduces these differences to a minimum. 
As a result, the Slovene document collection is indexed by 2,957 stems, comparing 
to 3,012 stems for the English equivalent. It is important to note that the almost 
equivalent number of stems in both databases lends full credibility to Ezperiment 
II. 
• A similar number of stems was achieved by 65.5% level of compression of the Slo 
vene dictionary file, and by 36.7% level of compression of the English equivalent. 
Since larger vocabularies were used than those employed in Chapter 4, greater 
reductions in the size of the vocabularies were noted. It is interesting to see that 
this experiment confirmed Porter's conclusion (1980) that the employment of his 
algorithm results in about one third reduction in the size of the vocabulary file. 
His vocabulary consisted of 10,000 words; the suffix stripping process resulted in 
6,370 distinct entries, achieving 36.3% level of compression. 
203 
A set of queries. In order to carry out Ezperiment II, an English translation of the 
Slovene queries was required. Table 8.2 illustrates the main comparative characteristics 
of the Slovene set of queries and the English equivalent after translation was complete. 
Quantitative characteristics (before stopwording) 
1. Number of queries 
2. Total number of terms in a set of queries 
3. Average number of terms per query 
4. Maximum number of terms per query 
5. Minimum number of terms per query 
SLOV 
48 
370 
7.7 
19 
2 
ENGL 
48 
399 
8.3 
21 
2 
Table 8.2: The main quantitative characteristics of the Slovene and English sets of 
queries 
Table 8.2 indicates minor differences in the frequency distribution of terms between 
the Slovene set of queries and its English counterpart. The latter displays a slightly 
larger total number of words (i.e., 399 terms in the English set compared to 370 terms 
in the Slovene set). Consequently, a slight difference in the average number of terms 
per query was also noted (8.3 terms per query in the English set; 7.7 terms per query 
in the Slovene set). These differences have been generated mainly by the following: 
• the translation service; 
• differences between Slovene and English grammar. 
Although an attempt was made to translate the queries from Slovene to English 
on the basis of a precise consistency, some of the queries required a broader accom-
modation of the English terminology. An example can be found in the Slovene phrase 
INDOK centre which is usually translated as Information and documentation centre 
and very rarely as INDOC centre. The following is an example of the Slovene query 
and its translation into English: 
204 
QUERY 6: 
SLOV: Specializirani INDOK centri v Sloveniji in na Hrvaškem. 
ENGL: Specialized INDOC centres (information and documentation services) in 
Slovenia and Croatia. 
The above example illustrates that the English query is characterized by a larger 
number of terms (11) than the Slovene query (8). The second reason for quantitative 
differences between English and Slovene query sets can be found in grammar. The En 
glish language is characterized by the frequent use of articles (i.e., A, AN, THE) which 
are not used in the Slovene language at ali. This is demonstrated by the following query. 
QUERY 23: 
SLOV: Obvezni izvod in slovenska nacionalna bibliografija 
ENGL: The deposit copy and the Slovene National Bibliography 
The English query again has a greater number of terms (8) than its Slovene coun-
terpart (6). Since terms such as A, AN, THE are included in the list of stop-words 
used by English IR systems, it is interesting to compare quantitative characteristics of 
both sets of queries after removal of stop-words. This is illustrated in Table 8.3. 
Quantitative characteristics (after stopwording) 
1. Number of queries 
2. Total number of terms in a set of queries 
3. Average number of terms per query 
4. Maximum number of terms per query 
5. Minimum number of terms per query 
SLOV 
48 
293 
6.1 
15 
2 
ENGL 
48 
292 
6.1 
13 
2 
Table 8.3: The main characteristics of the Slovene and English sets of queries after 
deletion of stop-words 
205 
The removal of stop-words from both sets of queries resulted in almost identical 
quantitative characteristics (i.e., the English set having in total 292 words, comparing 
to 293 words in the Slovene set). This is, of course, an important argument for the 
correctness of the multi-lingual experiment. The validitv of the project was further 
confirmed by the application of the stemming algorithms to the terms in the two sets 
of queries, as demonstrated in Table 8.4. 
Quantitative characteristics 
Number of word tvpes (stop-words deleted) 
Number of stem tvpes 
SLOV 
224 
148 
ENGL 
159 
144 
Table 8.4: Quantitative characteristics of the Slovene and English query sets before 
and after the emplovment of automatic stemming 
Table 8.4 shows again—as also demonstrated in Table 8.1—that two completelv 
different stemming algorithms (i.e., Slovene and Porter's) are able to produce almost 
identical numbers of stem tvpes. This is, of course, of crucial importance for the 
statistical rigour of the experiment. A list of queries translated into English can be 
found in Appendix D, and then compared with the list of Slovene queries as presented 
in Appendix B. 
Relevance assessments. Relevance judgments in Experiment II were carried out in 
the following manner: 
• relevance assessments of a set of retrieved documents were made by the same 
group of users as in Experiment I; 
• users were given the same instructions about carrving out relevance assessments 
as in Eiperiment I; 
206 
• users judged documents for their relevance on the basis of the information con-
tained in the title and abstract of each retrieved document; 
• the set of documents to be judged for relevance consisted of titles from best-
match searching, using the ranked-output cutoff procedure; again, only the first 
ten retrieved documents (i.e., cutoff point was 10) were used for a further analvsis; 
• these documents were added to the pool developed in Experiment I; 
• for the purpose of conducting a quantitative comparison, an additional pool was 
created containing documents retrieved by best-match search only—using au-
tomatic stemming—from both document collections; thus, both pools were a 
mixture of Slovene and English documents; 
It has to be emphasized that ali requesters were fluent in English. This means that their 
consistency in relevance judgments from Experiment I was more or less maintained. 
Apart from judging retrieved English documents for their relevance, a parallel pro-
cess was carried out as part of Ezperiment II. This method is known as a query ex-
pansion experiment, based on a string similarity measure. The main objective of this 
test was to retrieve and identify terms which are semantically related to a query term. 
A similar number of related retrieved terms from the Slovene and English dictionary 
should indicate that this statistically-based method could play an important role within 
a multi-lingual IR environment. 
English version of INSTRUCT 
The English variation of the INSTRUCT package was created on the basis of the Slo 
vene version. This means that the new English variation of INSTRUCT required only 
language-dependent procedures (mainly deletion of stop-words and automatic confla-
tion) to be modified. Again, the TURBO PASCAL 5.5. programming language was 
used. 
207 
8.4.2 The test procedures 
Collection of data 
To obtain the data required for the comparative evaluation, two main procedures were 
employed on the English document collection: best-match searching and identification 
of related stems. The best-match search required the application of the follovving steps: 
• searching across a document collection, using the set of 48 English queries; 
• retrieving a pool of document records that were then judged relevant or non-
relevant by requesters; 
• comparative evaluation of the results with the Slovene data from Experiment I. 
A query expansion experiment was emploved separately in the Slovene and English 
versions of the INSTRUCT package. Its application involved the follovving steps: 
• searching in the dictionary component of the inverted file for stems which are 
similar to the selected query stems; 
• retrieving, for each query stem, a pool of the 10 most similar stems (i.e., the stems 
having the largest number of trigrams in common with the selected query stem); 
• assessments of the retrieved stems for their semantical relationship with the query 
stem; 
• comparative evaluation of the resulting English and Slovene data. 
Methods for the analvsis of data 
The employment of either best-match searching procedures or a query expansion mod 
ule also affected the use of methods for the analysis of data. 
Results provided by the best-match search in Slovene and English database were 
analyzed by the following methods: 
208 
• a simple quantitative comparison; 
• a measurement of the retrieval effectiveness; 
• a failure analvsis. 
The idea of using a simple quantitative comparison between Slovene and English re 
trieval results (i.e., the number of identical hits) was very direct: to obtain preliminary 
results either on differences or similarities between two systems. This means that items 
retrieved from the English database were compared to documents retrieved from the 
Slovene collection. It was expected that the frequency distribution of the identical hits 
(and, in particular, identical relevant documents) should give indicators for using other 
methods for the comparative evaluation. This assumption was confirmed during the 
course of the experiment. 
The second method employed in Experiment II consisted of the retrieval effective 
ness measures, i.e., recall and precision. In addition, a statistical significance test—the 
sign test—was performed on the difference between the Slovene and English systems. 
A third method which was also used in this experiment is known as failure analysis. 
Its main objective was in providing answers to why certain relevant documents were 
not retrieved (Bawden, 1990). This type of analysis was shown to be extremely useful, 
in particular in demonstrating the ability to explain the differences between Slovene 
and English retrieval performance. 
The results obtained by the multi-lingual query expansion experiment were analyzed 
by the following two methods: 
• a simple quantitative comparison; 
• a failure analysis. 
A simple quantitative comparison was employed to provide an answer on the similar 
ities between English and Slovene. These similarities were calculated on the basis of 
retrieving a certain percentage of semantically related terms in each system. Since it 
209 
was difficult to expect that both systems would provide identical numbers of related 
stems per term, a failure analysis was also employed. 
8.5 Analysis of results 
8.5.1 Multi-lingual experiment, using best-match searching facility 
As stated above, the performances of the Slovene and English versions of INSTRUCT— 
from now on referred to as SLOV and ENGL, respectively—were first compared and 
analyzed within a nearest neighbour searching module. The results, analyzed by the 
employment of different methods, are described in the next sections. 
A simple quantitative comparison 
A pool of Slovene retrieved relevant documents was defined as a target for the com-
parative component in Experiment II. In other words, this simple analysis considered 
only those retrieved English documents which were identical to the Slovene documents. 
This means that non-identical English retrieved items were analyzed later within the 
context of measuring the retrieval effectiveness of both systems. 
After the INSTRUCT package had been applied to the English document collec-
tion, the follovving quantitative similarity—expressed in terms of number of identical 
documents—with the Slovene hits was determined. The processing of 48 English queries 
resulted in 243 hits (50.6%) which were also found in the Slovene list. In other words, 
the English version of INSTRUCT retrieved 237 documents (49.4%) which were not 
found by the Slovene set of queries (since 480 documents were retrieved in ali). 
The relatively low similarity between Slovene and English lists of retrieved doc 
uments has been partly improved by comparing only relevant documents from both 
listings. The English version of INSTRUCT was capable of retrieving 189 relevant 
documents which were also present in the Slovene list, consisting of 302 relevant items. 
This percentage of similarity was therefore 62.6%. On the other hand, only 54 identi-
210 
cal non-relevant English documents were found in the Slovene listing, containing 178 
non-relevant items (30.3%). Results of this simple comparison are summarized in Table 
8.5. 
SLOV: 480 hits 
ENGL: 243 identical hits (50.6%) 
SLOV: 302 relevants 
ENGL: 189 identical relevants (62.6%) 
SLOV: 178 non-relevants 
ENGL: 54 identical non-relevants (30.3%) 
Table 8.5: Percentage of the English retrieved documents which are identical to the 
documents in the Slovene list 
Table 8.5 demonstrates that the identical documents comprise a much larger per 
centage of relevant documents (62.6%) that non-relevant hits (30.3%). However, the 
proportion of retrieved failures (i.e., relevant documents retrieved only by the Slovene 
version of INSTRUCT, and not by the English one)—37.4%—begin to raise some ini-
tial doubts about using statistically-based IR techniques in multi-lingual information 
systems. 
Table 8.6 additionally illustrates the level of similarity between the English and 
Slovene capability in retrieving identical relevant documents. It is evident that only 
8 queries (16.7%) resulted in a list of entirely identical relevant documents. In most 
examples, the difference between English and Slovene lists of relevant documents was 
expressed either as 1 document (10 queries, or 20.8%) or as 2 documents (13 queries, 
or 27.1%). In total, 31 queries (64.6%) produced results where the difference between 
the Slovene and English listings was expressed either by none, one or two relevant 
documents. 
211 
Number of different documents 
0 
1 
2 
3 
4 
5 
6 
7 
8 
Total 
Freq 
8 
10 
13 
3 
6 
3 
3 
1 
1 
48 
Table 8.6: Frequency distribution of different relevant documents, retrieved by the 
English and Slovene versions of INSTRUCT. 
However, 17 queries (35.4%) were stili retrieving three or more relevant documents 
which were not presented in either the English or Slovene listing. This percentage again 
raised the question of whether simple statistically-based techniques—as implemented 
in INSTRUCT—are applicable as a straightforward method in designing multi-lingual 
IR systems. In order to be able to answer this question more data on the performance 
of both systems was required. 
Recall and precision as measures of retrieval efFectiveness 
A measurement of the retrieval efFectiveness was based on information from the pool 
of retrieved documents in Ezperiment I. The comparative analysis required this pool 
to be incremented with documents retrieved from the English database. This meant 
that—having applied a cutoff of 10—each query could be potentially represented by 40 
distinct retrieved documents. 
It has to be noted again that a relative recall was calculated, based on the substitute 
212 
lists of total relevant items in the collection (i.e., a pool of retrieved documents). On the 
basis of comparison of relevant items from the Slovene pool with the top 10 documents 
on the English list the absolute figures were obtained for each query, and then recall 
values calculated. 
Table 8.7 clearlv shows that applying the INSTRUCT package for searching docu-
ment in the Slovene collection produces better results than its emplovment in retriev-
ing documents in the English database. In other words, whilst the Slovene version of 
INSTRUCT retrieved 302 relevant documents (achieving a recall of 69%), its English 
equivalent produced a list with only 248 relevant documents (achieving a recall of 56%). 
This means that retrieval of items from the English document collection reported a total 
number of 54 retrieved failures. 
The precision figures were also extracted from Table 8.7, using the mean number of 
relevant documents, multiplied by 10. Since the mean number of relevant documents 
in the Slovene pool was equal to 6.3 documents, the value of precision was 63%. On 
the other hand, the mean number of relevant hits in the English pool was equal to 
5.2, resulting in a precision of 52%. The average recall and precision values of the two 
systems are summarized in Table 8.8. 
Although the English version of INSTRUCT retrieved some new relevant documents 
(its contribution to the total pool was 36 new relevant items, i.e., 8.2%), results in Table 
8.7 and in Table 8.8 indicate a quite large performance difference between the two 
systems. In other words, the Slovene version of INSTRUCT was much more successful 
in retrieving relevant documents. In order to flnd out whether this difference could be 
regarded as a statistically significant difference, the sign test was carried out. 
The sign test. The existence of the significant performance difference was tested by 
the employment of the following hypotheses: 
• Ho : the application of the Slovene and English versions of INSTRUCT results 
in the retrieval of a similar percentage of relevant documents; 
213 
Query 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
Total 
SLOV 
rel % 
6 100 
9 90 
9 90 
7 87 
8 80 
8 89 
10 56 
3 50 
7 54 
7 70 
9 56 
10 63 
8 89 
9 75 
9 69 
5 83 
7 100 
7 100 
6 38 
10 50 
10 77 
6 67 
6 75 
3 75 
6 43 
7 78 
8 73 
7 54 
5 56 
1 50 
5 56 
3 60 
5 100 
4 67 
10 77 
9 90 
3 75 
4 57 
7 70 
5 71 
3 37 
2 67 
6 100 
5 62 
6 100 
2 50 
5 71 
5 71 
302 69 
ENGL 
rel % 
4 67 
8 80 
9 90 
7 87 
7 70 
7 78 
8 44 
5 83 
5 38 
5 50 
7 44 
5 31 
8 89 
9 75 
7 54 
4 67 
5 71 
4 57 
3 19 
6 30 
8 62 
6 67 
7 88 
3 75 
7 50 
6 67 
5 45 
8 62 
4 44 
2 100 
2 22 
3 60 
5 100 
2 33 
8 62 
4 40 
2 50 
5 71 
4 40 
3 43 
2 25 
2 67 
6 100 
5 62 
5 83 
2 50 
2 29 
7 100 
248 56 
Pooled 
ret/rel 
no. docs 
6 
10 
10 
8 
10 
9 
18 
6 
13 
10 
16 
16 
9 
12 
13 
6 
7 
7 
16 
20 
13 
9 
8 
4 
14 
9 
11 
13 
9 
2 
9 
5 
5 
6 
13 
10 
4 
7 
10 
7 
8 
3 
6 
8 
6 
4 
7 
7 
439 
Total 
distinct 
retrieved 
no. docs 
23 
18 
20 
17 
21 
21 
23 
21 
22 
21 
21 
24 
16 
17 
15 
21 
21 
18 
27 
25 
22 
12 
18 
28 
23 
20 
18 
19 
24 
20 
28 
19 
14 
20 
20 
20 
27 
23 
25 
23 
26 
23 
23 
21 
20 
26 
22 
23 
1,029 
Table 8.7: Number of relevant retrieved documents and aggregate recall for the Slovene 
and English versions of INSTRUCT (cutoff 10) 
214 
Retrieval effectiveness 
RECALL 
PRECISION 
SLOV 
69 
63 
ENGL 
56 
52 
Table 8.8: The average recall and precision values of the Slovene and English versions 
of INSTRUCT (cutoff 10) 
• H\ : the emplovment of the Slovene package results in a larger number of relevant 
documents than use of the English svstem. 
The significance level (a) was defined as 0.05, the number of couples (N) was equal to 
the number of queries, i.e., N = 48, and the rej'ection region was one-tailed. 
Table 8.9 shows that the Slovene package was more successful in 29 observations, 
and less successful than the English svstem in only 7 observations. The number of tied 
cases was 12. 
Sign 
+ 
0 
Total 
Freq 
29 
7 
12 
48 
Table 8.9: Frequency distribution of the direction of differences between the Slovene 
and English versions of INSTRUCT 
Ali tied cases were dropped from further analvsis; consequently N was reduced to 
36. Since N was stili larger than 35, a method for large samples was applied (see Siegel 
215 
and Castellan, 1988), resulting in a value of z = 3.5. Reference to Table A in Siegel 
and Casteilan (1988) reveals that the probability z >— 3.5 when H0 is true is 0.00023. 
Since 0.00023 is smaller than a = 0.05, the decision was made to reject the null 
hypothesis in favour of the alternative hypothesis. Thus, the sign test proved that there 
is a significant performance difference between the Slovene and English systems. Or, in 
other vrords, the Slovene version of INSTRUCT is able to produce significantly better 
results than its English counterpart. 
These results, i.e., a low number of identical retrieved documents and a significant 
performance difference between the two systems was unexpected. As a result, the two 
main hypothesis of Experiment II were rejected by this experiment. In general, the 
experimental results gave the following indications: 
• the more than satisfactory performance of the Slovene version of INSTRUCT indi-
cates that statistically-based techniques can be introduced in Slovene IR systems 
without any hesitation; 
• the significant performance difference between the Slovene and English versions of 
INSTRUCT has èast a serious shadow on the use of statistically-based techniques 
(as implemented in INSTRUCT) as a suitable tool in developing multi-lingual 
information systems. 
This shadow was enlarged further after the so-called best-match pool was created, con-
sisting only of Slovene and English documents, retrieved by using automatic word 
conflation within a best-match context. AH relevant documents from this pool were 
analyzed in order to find out their origin, i.e., whether they had been retrieved by the 
Slovene version of INSTRUCT or by its English equivalent. Table 8.10 illustrates the 
frequency distribution of relevant documents according to their source. 
Table 8.10 shows again that the Slovene system was superior to the English system. 
While the Slovene version of INSTRUCT retrieved 113 relevant documents (31.3%) not 
found by the English version, the latter reported 59 relevant items (16.3%) not retrieved 
by the Slovene system. In addition, the percentage of identical relevant documents 
216 
Source 
Relevants retrieved by both SLOV and ENGL 
Relevants retrieved only by SLOV 
Relevants retrieved only by ENGL 
Total 
Freq 
189 
113 
59 
361 
% 
52.4 
31.3 
16.3 
100.0 
Table 8.10: Frequency distribution of retrieved relevant documents according to their 
source 
retrieved by both svstems was relatively low, i.e., 52.4%. 
In order to find reasons for the above expressed differences between the English and 
Slovene systems, and on this basis to suggest further solutions, a failure analysis was 
carried out. 
Failure analysis 
The employment of the failure analysis resulted in the identification of several possible 
causes of the Slovene and English retrieved failures. The discussion in this section will 
focuss on the following: 
• the quality of the translation service; 
• problems associated with the natural language processing of queries and docu 
ments; 
• performance differences between the Slovene and English (i.e., Porter's) stemming 
algorithms. 
Having obtained a large number of retrieved failures in the English version of IN-
STRUCT (i.e., 113 documents, or 31.3%) the first cause of failure which came to mind 
217 
was poor translation service. It is exactly this issue which was considered first. In 
addition, despite many advantages of the natural language query input and automatic 
indexing, there is one main problem which could also contribute to failures in retrieval. 
This problem is the question of how to èope with synonyms and other related terms 
without having, for example, a look-up thesaurus or other useful tools as a part of the 
system. 
Finally, automatic word conflation—as a language-dependent procedure—was also 
considered in detail as a part of the failure analysis. Since both algorithms, i.e., the 
Slovene algorithm and Porter's algorithm, were developed independently from each 
other, and are based on different principles, they can also potentially seriously affect 
retrieval performance. For example, whilst Porter's algorithm correctly treats a certain 
term, the Slovene algorithm can produce an overstemming of the same term, and vice 
versa. Consequently, differences in stem weights can appear, leading to inconsistencies 
in ranking of documents. 
To take into account possible causes of retrieved failures—as enumerated above— 
the analysis was carried out as follows: 
1. processing of ali queries from the Slovene and English sets; 
2. deletion of stop-words and application of both stemming algorithms; on this basis 
a list of query stems was obtained, illustrating the following details: 
(a) the number of documents in which a particular stem occurred; 
(b) the weight calculated for this stem; 
3. employment of best-match searching, resulting in two lists (Slovene and English) 
of the first ten retrieved documents for each query; 
4. application of failure analysis to find causes for retrieved failures; 
5. extension of the ranked cutoff point from 10 to 20 in order to check whether some 
of the missing documents are within the first 20 retrieved items; 
218 
6. analvsis of the frequency distribution of the causes of retrieved failures in order 
to find out which factor affected the most performance differences between the 
English and Slovene svstems. 
After the results of the detailed, time-consuming, failure analvsis had been obtained, 
it became possible to produce Table 8.11 which illustrates the contributions of these 
factors to retrieved failures. 
Cause of failure 
Poor translation 
"Uncontrolled" occurrence of synonyms 
Stemming algorithm 
Total 
SLOV 
12 
35 
12 
59 
ENGL 
12 
63 
38 
113 
Table 8.11: Frequency distribution of the main causes of retrieved failures in the English 
and Slovene systems 
A detailed analysis of these main factors—the most important being the occurrence 
of synonyms and other related terms in the text—is given belovv. 
Translation service. A poor translation was interpreted as the cause of retrieved 
failures for the following: 
• incorrectly written words (e.g., SPLOŠNO IZOBRAŽEVALNE which should be 
written as a single word); 
• improper translation (e.g. YUGOSLAVIA - NAŠA DRŽAVA; NATIONAL AND 
UNIVERSITY LIBRARY OF SLOVENIA - NARODNA IN UNIVERZITETNA 
KNJIŽNICA V LJUBLJANI) 
219 
SLOV 
_ 
naša država 
— 
projekt 
— 
Pionirska 
splošno izobraževalne 
Ljubljana 
— 
Ljubljana 
ENGL 
INDOC 
Yugoslavia 
UNISIST 
— 
Slovene 
Pionirska 
public 
— 
automatic 
Slovenia 
Table 8.12: Examples of poor translation service 
As shown in Table 8.11, the poor translation service resulted in 12 Slovene relevant 
documents (3.3%) not being retrieved; exactly the same number of retrieved failures 
was also reported for English documents. Table 8.12 illustrates some examples of bad 
translation ("—" indicates that the term was not translated at ali). 
The effect of poor translation (i.e., not translated terms, unsuitably translated 
words, etc.) is, of course, evident in the final ranking of retrieved documents. This can 
be illustrated with the employment of two queries: 
QUERY 36: 
Karkoli o Narodni in univerzitetni knjižnici (NUK) v Ljubljani. 
Anything about the National and University Library in Ljubljana. 
The processing of this query (NUK is the abbreviation for the National and Uni-
versity Library) resulted in the following list of stems (numbers in parentheses indicate 
document frequency, numbers in the square brackets show the weight of a particular 
stem). 
220 
SLOV 
NAROD (27) [3.8717] 
UNIVER (85) [2.5952] 
KNJIŽ (240) [1.0953] 
LJUBLJAN (43) [3.3722] 
NUK (11) [4.8267] 
ENGL 
NATION (76) [2.7284] 
UNIVERS (118) [2.1832] 
LIBRARI (247) [1.0397] 
LJUBLJANA (46) [3.2982] 
Processing of the Slovene query resulted in 9 relevant documents, and its English 
equivalent retrieved only 4 relevant items. The following example shows why an English 
document became a retrieved failure. 
2/113 
Referalna literatura — instrument informatorja (Ob oblikovanju kataloga referalne 
literature v NUK) 
* Jakac-Bizjak, V.: Knjižnica, 29(1985)1, str. 74-79 
Prispevek prinaša nekaj misli o referalni literaturi in o informacijski službi kot je 
Narodna in univerzitetna knjižnica v Ljubljani. ... 
28/113 
Reference literature — the instrument of the information librarian (The promotion 
of a new reference literature catalogue in the National and University Library of 
Slovenia) 
* Jakac-Bizjak, V.: Knjižnica, 29(1985)1, pp. 74-79 
This contribution represents some new approaches to the reference literature and 
information service in the libraries such as National and University Library of 
Slovenia. ... 
The above example illustrates the improper translation of the word LJUBLJANA 
to SLOVENIA as emphasized by italics in the text. Since the term SLOVENIA did 
not appear in the query, the English document was ranked at the 28th position. In 
contrary, its Slovene equivalent achieved a high, 2nd plaèe. 
QUERY 23: 
Obvezni izvod in slovenska nacionalna bibliografija. 
The deposit copy and the Slovene National Bibliography. 
Again, the processing of this query produced the follovving list of stems: 
221 
SLOV 
OBVEZ (9) [5.0073] 
IZVOD (3) [6.1180] 
SLOVEN (74) [2.7597] 
NACIJ (43) [3.3722] 
BIBLIOGRAF (49) [3.2285] 
ENGL 
DEPOSIT (7) [5.2627] 
ÈOPI (10) [4.9000] 
SLOVENE (41) [3.4242] 
NATION (76) [2.7284] 
BIBLIOGRAPHI (26) [3.9115] 
While the English version of INSTRUCT retrieved 7 relevant documents, its Slovene 
equivalent found 6 relevant items. In addition, the Slovene version reported 2 retrieved 
failures. The main cause of failure was again a poor translation service, this time in 
Slovene texts. One of the Slovene retrieved failures is presented below. 
8/94 
Description of monograph publications in Slovene National Bibliography and Na 
tional Bibliography of Yugoslavia 
* Ženi, J.: Knjižnica, 28(1984)1/2, pp. 16-34 
Slovene National Bibliography and National Bibliography of Yugoslavia of 1975 
have already been written according to ISBD(M), ... 
18/94 
Opis zakljuèenih publikacij v Slovenski bibliografiji in Bibliografiji Jugoslavije 
* Ženi, J.: Knjižnica, 28(1984)1/2, str. 16-34 
Slovenska bibliografija in Bibliografija Jugoslavije sta v popisih za leto 1975 že 
prešli na ISBD(M). ... 
The above example shows that the absence of the word NACIONALNA in the Slo 
vene document caused this document to be ranked as 18th. Since the word NATIONAL 
was present in the English document, its ranking was much higher, i.e., this document 
was among the first ten retrieved items. The second Slovene retrieved failure was again 
generated by the omission of the word NACIONALNA from the text. 
These examples demonstrate that the translation of documents from Slovene to 
English could be carried out by a better service. This comment is, in particular, a 
criticism of the journal Knjižnica whose translation service should achieve significant 
improvements in the future. 
Natural language processing of queries and documents. The main barrier 
towards a more effective multi-lingual approach was caused—as evident from Table 
8.11—by natural language. A frequent occurrence of synonyms and related terms is a 
222 
feature of natural language which cannot be controlled during translation of texts. In 
addition, natural language processing in INSTRUCT is based on weighted single terms. 
In other words, as stated by Salton (1988) "...it is obviously not the èase that the full 
text content is easily representable by single term sets". 
There is no doubt that synonyms and other related terms whose appearance within 
INSTRUCT was not controlled, contributed the most to the significant performance 
difference between the Slovene and English systems. The follovving is a list presenting 
some examples of Slovene and English synonyms and related terms from both document 
collections: 
SLOV: 
Velika Britanija — Great Britain, United Kingdom 
iskanje — searching, retrieval 
poizvedba — query, question 
javen — public, mass 
eksponat — object, item 
ENGL: 
index — indeks, kljuèna beseda 
software — softver, program, programska oprema 
analysis — analiza, obdelava 
staff— delavci, osebje, kader 
training — usposabljanje, vzgoja 
nation — nacija, narod 
Yugoslavia — Jugoslavija, SFRJ 
data base — podatkovna zbirka, baza podatkov, datoteka 
user — uporabnik, bralec 
These are only some of the examples which illustrate problems when it comes to 
the translation of a certain term. As shown in Table 8.11, the existence of synonyms 
and related terms caused 35 Slovene retrieved failures (9.7%). This percentage was 
even larger in the English system which reported 63 retrieved failures (17.4%). The 
examples belovv illustrate the effect of natural language on retrieval performance. 
QUERY 2: 
Knjižnice in informacijski centri v Veliki Britaniji in Indiji. 
223 
Libraries and information centres (centers) in Great Britain and India. 
The processing of this query produced the following list of stems: 
SLOV 
KNJIZ (240) [1.0953] 
INFOR (261) [0.9285] 
CENT (101) [2.3838] 
VELIK (47) [3.2745] 
BRITAN (7) [5.2627] 
INDIJ (6) [5.4188] 
ENGL 
LIBRARI (247) [1.0397] 
INFORM (261) [0.9285] 
CENTR (32) [3.6912] 
CENTER (50) [3.2061] 
GREAT (25) [3.9528] 
BRITAIN (4) [5.8283] 
INDIA (6) [5.4188] 
While the Slovene version of INSTRUCT retrieved 9 relevant documents, the En-
glish processing of this query resulted in 8 relevant documents. One retrieved failure 
was caused by the occurrence of a related term. In other words, the phrase VELIKA 
BRITANIJA was translated in ali documents to GREAT BRITAIN, apart from one 
item where the term UNITED KINGDOM was used. 
1/154 
EADI biblioteèno-dokumentacijskoinformacijska družina se je ponovno sestala 
* Potoènik-Kovše, T.: Knjižnica, 31(1987)1, str. 89-99 
Prispevek je kratko poroèilo o EADI seminarju v Brightonu (Velika Britanija), 
septembra 1986, ... 
40/154 
EADI librarv-documentation-information family has been afresh brought together 
* Potoènik-Kovše, T.: Knjižnica, 31(1987)1, pp. 89-99 
The contribution is a brief report on EADI meeting in Brighton (United Kingdom), 
September, 1986, ... 
Since UNITED KINGDOM was not included in the query, its occurrence in the 
document (as a substitute for GREAT BRITAIN) caused a significant decline of this 
document in the output list. A similar situation was reported also for one Slovene 
document as illustrated below. 
QUERY 41: 
Specializirane baze podatkov (podatkovne zbirke) v Jugoslaviji. 
Specialized databases (data bases, collections) in Yugoslavia. 
224 
The following list of stems was obtained: 
SLOV 
SPECIAL (42) [3.3979] 
PODAT (106) [2.3230] 
BAZ (22) [4.0869] 
ZBIR (51) [3.1841] 
JUGOSL (66) [2.8926] 
ENGL 
SPECIAL (101) [2.3838] 
DATA (106) [2.3230] 
BASE (65) [2.9101] 
DATABAS (16) [4.4177] 
COLLECT (47) [3.2745] 
YUGOSLAVIA (52) [3.1624] 
Although the Slovene version of INSTRUCT retrieved more relevant documents 
than its English counterpart (3 vs 2) there was stili 1 retrieved failure reported in the 
Slovene list. This document is analvzed below. 
3/145 
A short description of the development of Slovene librarv information svstem with 
respect to building industry 
* Verbiè, D., Perc-Kovaèiè, C, Majcen-Èuènik, N.: Knjižnica, 30(1986)3/4, pp. 
81-90 
... General standpoints of the Yugoslav library information system derive from the 
basic principles of UNISIST. ... 
21/145 
Kratek opis razvoja knjižnièno-informacijskega sistema v SR Sloveniji (KIS) s 
poudarkom na graditeljstvu 
* Verbiè, D., Perc-Kovaèiè, C, Majcen-Èuènik, N.: Knjižnica, 30(1986)3/4, str. 
81-90 
... Širša izhodišèa KIS v SFRJ izhajajo iz osnovnih naèel razvoja svetovnega 
sistema znanstvenih informacij — UNISIST. ... 
The absence of the word SFRJ (a synonym for JUGOSLAVIJA) from the query 
caused the above Slovene document—ranking 3rd in the English list—to be dropped to 
21st plaèe. Since a ranked cutoff was defined at 10, this document was not presented 
in the pool of retrieved items; it was, consequently, considered as a retrieved failure. 
The above examples have demonstrated that the main problem derives from the 
question of how to improve the content analysis of a multi-lingual text. There is no 
doubt that automatic indexing techniques (assignment of weights to the single index 
terms) cannot entirely represent the full text content in two languages. One of the best 
known techniques which can èope with synonyms and other related terms is known as 
controlled indexing. This technique provides a representation of a wide variety of related 
225 
terms and descriptors by a single standard term or phrase. Such an indexing process 
is usually controlled by a thesaurus that contains classifications of similar words, and 
it is useful in broadening the indexing vocabulary by supplying synonyms and other 
related words. 
There is no doubt that the implementation of a thesaurus look-up process in IN-
STRUCT would improve its multi-lingual retrieval effectiveness. A multi-lingual the 
saurus would contribute to the recognition of synonyms and other related words by 
replacing the original word stems with the corresponding thesaurus categories. Unfor-
tunately, the problem is that thesauri constructed from particular document collections 
are not easily applied to new situations and new collections. Hence a given thesaurus 
often provides improvements that are valid only locally and under special circumstances 
(Salton, 1988). 
This is a very important point when considering possible techniques for improving a 
statistically-based approach in a multi-lingual IR environment. Any refinements based 
on the application of a multi-lingual thesaurus are bound to the specific document 
collections and subject areas. 
Stemming algorithm. It is pointed out by Porter (1980) that the employment of 
suffix stripping usually results in a performance significantly less than 100%. The main 
function of the stemming algorithm design—in Porter's words—is to keep the balance 
between the number of stemming rules and the efficiency of processing. 
Results in Table 8.11 are in correlation with Porter's statement. It can be seen that 
his algorithm produced 38 retrieved failures. In other vrords, 10.5% of missing relevant 
English documents were due to the "errors" in suffix stripping. This percentage was 
much lower for Slovene documents where only 12 missing relevant documents (3.3%) 
were reported. It has to be noted immediately that the successful performance of 
the Slovene stemming algorithm probably derives from the fact that this stemming 
algorithm (in particular, the list of sufflxes) was produced using a text corpus from the 
librarianship and information science field. As it is known, the Slovene test collection in 
226 
Ezperiment II covers exactly the same area. The employment of a document collection 
from any other field (e.g., engineering) would probably increase the number of retrieved 
failures also for the Slovene stemming algorithm. 
However, it is interesting to note that—despite the presence of both types of stem 
ming errors—Porter's algorithm has been in particular affected by understemming. A 
similar conclusion was also reported by Lennon et al. (1981). Some examples of both 
types of errors, as encountered in the queries and documents from the test collection 
in Experiment II, are presented below. 
1. understemming: 
(a) SLOV: muzeol - muzej; avtomat - avtomatiz; baz - baziè; domoznan -
domoznanst; softv - softver; strok - strokov; 
(b) ENGL: Britain - British; Slovene - Slovenia; Yugoslav - Yugoslavia; scientif 
- scientist - scienc; librarianship - librari; analysi - analyt - analyz; mu-
seum - museologi; bibliograph - bibliographi; comput - computer; method 
- methodolog - methodologi; 
2. overstemming: 
(a) SLOV: REFER - referalni, referat; KNJIŽ - knjižnica, knjiga; CENT -
centralni, center; GRAD - gradnja, gradivo; 
(b) ENGL: PUBLIC - public, publications; RELAT - relations, related; IDENT 
- identity, identical; UNIVERS - universitv, universal; COMMUN - com-
munication, community. 
Both types of errors can seriously affect multi-lingual retrieval of documents. While 
understemming results in many relevant documents not being retrieved at ali, overstem 
ming generates retrieval of non-relevant items. Both errors are in particular evident 
by the assignments of weights to word stems. Consider the following example. The 
application of Porter's algorithm to terms UNIVERSAL and UNIVERSITY resulted 
in the stem UNIVERS-. This stem occurred in 118 documents (its weight was equal to 
227 
2.185). On the other hand, the Slovene stemming algorithm conflated UNIVERZALNA 
to UNIVERZAL (occurring in 12 documents, with weight 4.714) and UNIVERZA to 
UNIVER- (occurring in 85 documents, with weight 2.596). These differences, of course, 
produce different rankings of the Slovene and English documents, and—at the end— 
different retrieval performance. 
This is illustrated by the employment of two queries from the test collections. Since 
both queries are representative of very short statements, one would expect identical 
lists of Slovene and English retrieved documents: in fact, these two queries will demon-
strate how the employment of two different stemming algorithms can seriously affect 
performance results. 
QUERY 12: 
Splošnoizobraževalno knjižnièarstvo v Ljubljani. 
Public librarianship in Ljubljana. 
The processing of this query resulted in the following list of stems: 
SLOV 
SPLOSNOIZOBRAZ (36) [3.5649] 
KNJIŽ (240) [1.0953] 
LJUBLJAN (43) [3.3722] 
ENGL 
PUBLIC (132) [2.0361] 
LIBRARIANSHIP (58) [3.0399] 
LJUBLJANA (46) [3.2989] 
VVhilst the processing of this query produced 10 relevant Slovene items, only 5 En 
glish relevant documents were retrieved. In addition, there was only 1 Slovene retrieved 
failure reported, compared with 6 failures in the English output. Using the above list 
of English stems it is fairly easy to explain failures. Both types of stemming errors are 
present, as follows: 
• overstemming: two non-related words PUBLIC (e.g., PUBLIC LIBRARIES) and 
PUBLICATIONS are conflated to the same stem PUBLIC; 
• understemming: LIBRARIANSHIP is not conflated to the stem LIBRARI. 
228 
In contrary to Porter's algorithm, the Slovene algorithm was much more effective 
and did not report any errors in this query. The result is, of course, retrieval of a larger 
number of Slovene relevant documents. 
QUERY 48: 
Centralni katalog — avtomatizacija. 
Union catalogue — automation. 
The processing of this short query produced the following list of stems: 
SLOV 
CENT (101) [2.3838] 
KATAL (31) [3.7251] 
AVTOMAT (24) [3.9957] 
ENGL 
UNION (10) [4.9000] 
CATALOGU (30) [3.7600] 
AUTOM (18) [4.2958] 
In this èase, better performance results were obtained for the English system. While 
7 English relevant documents were retrieved, the Slovene system reported retrieval of 
only 5 relevant items. A less effective performance of the Slovene version of INSTRUCT 
was caused by overstemming, i.e., a stem CENT- also contained non-related words such 
as CENTER and CENTRALNI. 
The above two examples amply prove that performance of the stemming algorithm 
can seriously affect multi-lingual output. 
Cutoff 20. Experiments concerning the evaluation of IR systems by producing lists 
of ranked documents have to impose cutoff points. Althougli the implementation of 
a ranked cutoff is usually helpful, an artificial distinction within a set of retrieved 
documents can also occur. In order to find out whether the cutoff factor significantly 
affects the performance difference between the Slovene and English systems, the ranked 
cutoff point was extended to 20. 
After the top 20 documents from both the Slovene and English lists of retrieved 
items were analyzed the following numbers of the missing relevant documents were 
obtained: 
229 
• Slovene relevant documents: 32 
• English relevant documents: 49 
In other words, a total number of relevant documents retrieved by each svstems was 
now as follows (see Table 8.13). 
Retrieved documents 
SLOV 
ENGL 
Relevant documents 
334 
297 
Table 8.13: Number of Slovene and English relevant documents at cutoff 20 
Table 8.13 demonstrates that—despite the fact that a larger number of the English 
missing relevant documents were "hidden" within rank 11-20—a performance difference 
between the Slovene and English svstems stili remains. To test whether there was a 
significant difference between both svstems, a sign test was applied. A value of 0.0031 
was obtained; this value was within the region of rejection for a = 0.05. On this basis, 
a null hvpothesis stating that there is no difference between the Slovene and English 
svstems, was rejected. In other words, despite the cutoff point being extended to 20, 
the Slovene version of INSTRUCT stili performed significantly better. 
However, there is no doubt that these results provide additional insight into the ap-
plication of statistically-based techniques for multi-lingual IR svstems. Having a large 
percentage of the English relevant documents ranking very close to their Slovene equiv-
alents, the employment of iteration procedures in retrieval (e.g., a relevance feedback 
search) could signiftcantly improve multi-lingual output. However, in order to test this 
assumption a much larger document collection—increasing the dispersion of retrieved 
relevant documents—would be required. A database containing 504 documents as em-
ployed in Experiment II is too small to provide reliable results about iteration searching 
in the multi-lingual context. 
230 
Recommendations using results of a failure analysis. To summarize, the fail-
ure analysis carried out in Ezperiment II has demonstrated that simple statistically-
based techniques as implemented within the INSTRUCT package are questionable as 
a straightforward method in multi-lingual IR systems. The failure analysis found the 
following two main "barriers": 
• natural language (synonyms and other related terms); 
• stemming algorithms 
Hovvever, the very successful performance results obtained by the Slovene version of 
INSTRUCT indicate that statistically-based techniques could stili provide a framework 
for a multi-lingual IR system. A prerequisite is, of course, that they are enhanced 
with some other refinements (e.g., a thesaurus look-up process, iteration searching, 
knowledge-based approach, etc). It is in this area where additional experimental tests 
are needed. 
Finally—if nothing else—the above results have strongly confirmed that non-con-
ventional, statistically-based techniques, can be introduced in the Slovene operational 
IR systems. This was additionally proved by the experiment described below. 
8.5.2 A multi-lingual experiment based on the identification of word 
variant s 
Although the stemming procedure manages to conflate many morphological word vari-
ants, INSTRUCT offers further assistance to the end-user who wishes to identify terms 
that may be useful for the search, and that occur within the document file. This is 
achieved by the searcher selecting a stem of interest and then allowing the system to 
identify in the dictionary component of the inverted file those stems which are most 
similar to the chosen query stem. 
The measure of similarity used is based upon the numbers of trigrams, i.e., strings of 
three characters, common to the query stem and each of the stems in the dictionary file. 
This may be illustrated by the words MIKROFILM and MIKROŽEPEK which give 
231 
rise to the stems MIKROFIL- and MIKROZEP- respectively, and are characterized 
by the the following two trigram lists ($ denotes the space character): 
$MI MIK IKR KRO ROF OFI FIL IL$ 
$MI MIK IKR KRO ROŽ OŽE ŽEP EPS 
The number of trigrams in common (in the above example, there are four identical 
trigrams) with the query stem is calculated for each of the stems in the database, and 
these numbers sorted into descending order so as to identify the stems that are most 
similar to the chosen query. Thus, the submission of the stem MIKROFIL-, from 
MIKROFILM, results in retrieval and display of the follovving 10 most similar stems: 
1 MIKROFILMAM 
3 MIKROPROF 
5 MIKROZEP 
7 MIKROOB 
9 MIKROSNEM 
(1) 
(1) 
(1) 
(4) 
(1) 
2 MIKROFIS 
4 MIKRORAÈUNAL 
6 MIKROGRAF 
8 MIKROPROCESOR 
10 ŽIVIL 
(1) 
(13) 
(4) 
(1) 
(1) 
Once the 10 most similar stems have been displayed, the user can select any of them 
from the list for inclusion in the query. 
This means of identifying word variants was first used on a substantial scale by 
Freund and Willett (1982), and an example of an operational retrieval system that ušes 
the approach is described by Porter (1983). The advantage of the approach is that 
it permits not just the identification of morphological variants, but also other sorts of 
variants such as spelling errors, valid alternative spellings, and words with different 
prefixes. 
In Ezperiment II, a test was employed to find out whether this approach can equally 
well be applied to Slovene and English words. The following two sets of stems were 
processed: 
• 148 Slovene query stems and 2,957 document stems from the Slovene dictionary 
file; 
• 144 English query stems and 3,012 document stems from the English dictionary 
file. 
232 
Displayed index stems (at cutofF 10) were considered to be related to a query stem if 
they 
• had the same basic character structure as the query stem and were semantically 
related to it, e.g. COMPUT and MICROCOMPUT; SCIENTIF and SCIEN-
TIST; DOSTOP and PRISTOP; MUZEJ and MUZEOL; or 
• were unmistakable misspellings of the query term which could not have arisen 
from the misspelling of the another word, e.g. LIBRARI and LIBARI; BIBLIOT 
and BIBLIT; RAÈUNAL and RAÈUNALKIK. 
(In each èase, the query stem has been italized). Similar criteria for the definition of 
a related term were employed in the experiment carried out by Freund and Willett 
(1982). 
Analysis of results 
After ali stems from both query sets had been processed by the query expansion module, 
the following results were obtained, as shown in Table 8.14. 
Query set 
SLOV 
ENGL 
Stem types 
148 
144 
Related stems 
223 
240 
Related stems per term 
1.5 
1.7 
Table 8.14: Number of retrleved related stems, using Slovene and English query sets 
Table 8.14 shows that the application of the query expansion module, based on a 
measure of trigram similarity, produced very similar results. Whilst searching in the 
English dictionary nle resulted in 240 related terms (1.7 term per stem), stems from 
the Slovene query set retrieved 223 related words in the Slovene dictionary file (1.5 per 
stem). 
233 
It is important to note at this point that the large number of non-related terms in 
both English and Slovene displays is due to the fact that this approach was based on the 
identification of related stems, and not word variants. Since both stemming algorithms 
have already managed to reduce morphological variants of words in dictionary files, a 
trigram similarity approach was trying to identify the following: 
• stemming errors (in particular, errors caused by understemming); 
• semantically related stems (in particular, stems with different prefixes); 
• spelling errors. 
According to datain Table 8.15, the identification of stem variants was most effective in 
the area of retrieving semantically related stems which, in particular, differed in prefixes 
(e.g., COMMUN retrieved INTERCOMMUN, TELECOMMUN, COMMUNICOLOGI; 
GRAD retrieved DOGRAD, IZGRAD, NADGRAD, etc). The set of English stems 
retrieved 173 (72.1%) semantically related terms; the processing of the Slovene set 
produced 174 (78.0%) semantically related stems. This demonstrates a great level of 
similarity between the Slovene and English results. 
Identification of: 
1. Stemming errors 
2. Semantically related stems 
3. Spelling errors 
SLOV 
35 
174 
14 
ENGL 
66 
173 
1 
Table 8.15: Identification of related stems 
The main difference between the two systems was in the identification of stemming 
errors. Applying the term expansion module to the stems in the English set of queries 
resulted in retrieval of 66 terms (27.5%) from the dictionary file which were under-
stemmed. In the Slovene list of retrieved related terms, the percentage of such stems 
was 15.7% (i.e., 35 stems). The larger percentage of retrieved understemmed terms in 
234 
English is mainly due to the characteristics of Porter's algorithm, as emphasized in the 
previous section (e.g., stems SCIENTIF-, SCIENTIST-, SCIENC- are not conflated 
to a single root). 
In this experiment, it was also interesting to note the performance of the automatic 
spell-checker. Whilst 14 spelling errors (6.3%) were identified in the Slovene dictionary 
file, the occurrence of only 1 English spelling error demonstrates the lack of automatic 
spell-checkers for Slovene. 
Despite some of the above performance differences, a query expansion experiment— 
using a trigram similarity measure—confirmed the initial assumptions. The similar 
number of semantically related stems that were obtained from dictionary components 
of the English and Slovene inverted files indicates that this approach is applicable also 
to the Slovene IR systems. In addition, this type of query expansion could potentially 
help other statistically-based techniques to become more effective within a multi-lingual 
IR environment. 
8.6 Conclusions 
The detailed analysis of the results has confirmed only one of the three main hypotheses. 
This hypothesis (HYPOTHESIS 3) stated that identification of stem variants—using 
a string similarity measure—will produce a similar number of related terms from both 
English and Slovene dictionary components of the inverted files. The other two hy-
potheses (HYPOTHESIS 1 and HYPOTHESIS 2) were rejected, i.e., : 
1. Processing of the English documents and queries within a best-match environ 
ment did not produce more or less identical hits to those retrieved from the 
Slovene database. The level of only 50% identical hits is sufficient reason to 
reject HYPOTHESIS 1. 
2. There was a significant performance difference in retrieving relevant documents. 
The Slovene version of INSTRUCT produced significantly better results than its 
English equivalent. 
235 
The frequent occurrence of synonyms and other related terms in natural language which 
are not controlled in INSTRUCT, and automatic word conflation carried out by two 
different stemming algorithms were detected by a failure analysis as the main reasons 
for performance difFerences between the Slovene and English systems. Thus, the results 
of Ezperiment II provided the following conclusions: 
1. The successful performance of the Slovene version of the INSTRUCT package has 
confirmed that statistically-based techniques can be implemented in the Slovene 
operational IR systems without any hesitation. 
2. Simple statistical techniques—using single term indexing with term weights assig-
nments—are not appropriate as a straightforward method in the multi-lingual IR 
approach. 
However, Experiment II provided some indicators that statistically-based IR techniques 
could stili provide a broad framework for multi-lingual retrieval. The prerequisite is 
that they are enhanced with other refinements. The main problem to be solved in a 
multi-lingual information system is how to improve the content analysis of multi-lingual 
text. There is no doubt that automatic indexing techniques (assignment of weights to 
the single index terms) cannot entirely represent the full text content in two languages. 
Thus, the implementation of a thesaurus look-up process can, for example, potentially 
improve multi-lingual retrieval effectiveness. 
Some other retrieval techniques can also be considered as promising ways to improve 
a multi-lingual output. Whilst some of them are also based on statistical principles (for 
example, iteration searching - relevance feedback search, query expansion modules), the 
others derive from the knowledge-based environment. However, the effect of these tech-
niques on the improvement of a simple statistically-based approach in a multi-lingual IR 
system has not been tested within Ezperiment II. Thus, any recommendations for com-
bining these techniques with single term indexing—in order to improve a multi-lingual 
output—remain speculative. 
236 
Chapter 9 
Conclusions 
9.1 Introduction 
The primary aim of this project, which started in the academic year 1988/89, was 
to facilitate end-user access to bibliographic databases in Slovenia. At present, the 
information retrieval environment in Slovenia is characterized by the following: 
• the growing number of bibliographic and other types of databases; 
• increasing user demands for accurate and up-to-date information within a multi-
lingual context; 
• the application of different information retrieval systems. 
It is important to note that ali software systems (e.g, ATLASS, TRIP) available for 
accessing these databases are typical of current retrieval software elsewhere in that they 
are based on Boolean searching, with professional intermediaries being used to carry out 
on-line searches on behalf of end-users. Furthermore, the effectiveness and efficiency 
of these systems have rarely been evaluated. Consequently, modern, non-conventional 
methods and techniques of information retrieval which allow direct end-users interaction 
with the system are neither incorporated into existing retrieval systems in Slovenia, nor 
has much research been carried out in this area. 
237 
This is, of course, a very questionable situation because the provision of end-user 
searching facilities has been recognized in many countries as the only way to remove a 
barrier between the original source of a query and the query's answer. 
One of the main research areas in information retrieval whose main aim is to enable 
end-users to carry out searching in both an efficient and effective manner is based on the 
development of algorithmic procedures which allow the computer to undertake many 
of the functions of a trained intermediary. This approach, based on the use of a range 
of statistical techniques, is also known as statistically-based retrieval. 
Many such document retrieval systems have been described in the research literature 
and operational implementations of some of these ideas are now available (Willett, 
1988a). To date, the great bulk of this work has been carried out with English language 
material, where the necessary linguistic facilities, i.e., stop-word lists and stemming 
routines, have been available for many years (Lovins, 1968). Therefore, the main 
problem which was investigated in the context of this PhD project, is contained in 
the following two questions: 
1. Are statistically-based techniques applicable to Slovene information retrieval sys-
tems? 
2. Could statistically-based techniques provide a framework for developing multi-
lingual information retrieval systems? 
The second point is of particular importance to end-users in Slovenia, who are sur-
rounded by document collections written in different languages (i.e., Yugoslav languages 
and other major European languages). 
9.2 Summary of results and conclusions 
9.2.1 Development of a stop-word list and a stemming algorithm 
The use of best-match searching techniques in a Slovene information retrieval envi-
ronment was tested by the employment of the INSTRUCT (INteractive System for 
238 
Teaching Retrieval Using Computational Techniques). The processing routines in IN-
STRUCT are, in very large part, independent of the actual language in which the text 
have been written. However, the implementation of a Slovene language-based infor-
mation retrieval system required development of the following two language-dependent 
components: 
• the creation of a general purpose stop-word list; 
• the design of a povverful stemming algorithm which takes account of the language's 
morphological structure. 
The main feature of the Slovene language in the context of a system for automatic 
word conflation is that new word forms are created by adding derivational and inflec-
tional suffixes to a basic stem. Thus, as with English, many distinctive words with 
similar meanings can be created from a single stem and it should be possible to imple-
ment a conflation procedure for these morphological variants by procedures that utilize 
a set of suffbces. However, a detailed analysis of the morphological structure of the 
Slovene language revealed the following: 
• the Slovene language exhibits an extremely rich inflectional morphology in both 
verbal and nominal systems; for example, the word root RAZISKOVA (RE 
SEARCH) can occur in any one of no less than 94 different forms; 
• in addition, Slovene is characterized by various types of morphemic alternations, 
occurring in both stems and suffixes during the inflection. 
This implies that an effective stemming algorithm for Slovene text is likely to require 
many more suffixes and many more complex context-sensitive and recoding rules than 
is the èase with English. In addition, it is extremely difficult to establish iteration 
patterns; thus, the use of a longest-match algorithm was studied. However, the main 
aim of the design process was to obtain a reasonable balance between, on the one hand, 
the number of rules, and on the other hand, simplicity and efficiency of processing. 
239 
The starting point for the design of both a stop-word list and a stemming algorithm 
was data about the general frequency characteristics of Slovene as extracted by exten-
sive study of two Slovene text corpora. The frequency of occurrence of the word types 
in these text corpora followed a typical Zipfian distribution, with a very few word types 
providing a very high percentage of the observed tokens. Although these characteristics 
were analogous to those of English (and many other languages) it was not possible to 
identify words for inclusion in a stop-word list merely by taking account of the most 
frequently occurring words (as is commonly done in the èase of stop-word list for En 
glish databases). Such a procedure would ignore a very important difference between 
Slovene and English, this being the much greater number of distinct word types that 
are encountered in the former's natural language text. 
Many of the large numbers of low-frequency Slovene words are morphological vari-
ants of very commonly occurring function words that certainly should be included in 
a stop-word list; equally certainly, these low frequency word variants will not be in 
cluded merely by selecting the most frequently occurring words. The production of a 
stop-word list for the Slovene language thus entailed a much greater level of detailed, 
manual involvement than is required for the construction of a stop-word list for the 
English language. 
The resulting stop-word list contained a total of 1,593 non-content bearing words, 
these consisting of function words such as prepositions, pronouns, auxiliary verbs, con-
junctions, etc, together with a small core of other types of terms carrying extremely 
low meaning in phrases or sentences. It should be noted that this stop-word list can be 
described as the first general purpose stop-word list created for the Slovene language. 
The results of the evaluation—a comparable level of compression as obtained when 
similar procedures were applied to English texts, and a successful level of indexing— 
demonstrated the potential applicability of this list to any information retrieval system 
in Slovenia. 
The first step towards the design of an effective automatic conflation procedure 
for Slovene text was the development of a simple context-free, stemming algorithm, 
240 
in which the most frequently occurring endings from the sorted list of reversed words 
were chosen as the suffixes. No recoding or context-sensitive rules were used, the only 
constraint on suffix removal being that the remaining stem should not contain less 
than three characters. This approach was clearly crude in concept but avoided the 
need for detailed manual processing that characterizes most other ways of creating lists 
of suffixes. However, the performance of this algorithm was far from satisfactory. The 
best overall results were obtained with the list containing 2,000 suffixes; even here, 
however, less than 40% of the words were conflated to the correct root. The poor level 
of performance meant that a more complex, context-sensitive algorithm needed to be 
developed. 
This algorithm was developed using the traditional, trial-and-error approach that 
characterizes most context-dependent algorithms. Consideration was given as to the 
minimum stem length which should be left after the removal of a given ending, of 
new endings which needed to be added to the suffix list or endings which needed 
to be removed from it, and of the context-sensitive and recoding rules needed for 
accurate conflation. It was often the èase that selection of one suffix would require the 
adoption or removal of other suffixes, or the addition of context-sensitive rules in order 
to maintain consistency; this behavior is, of course, characteristic of ali languages and 
not specific to Slovene. 
The resulting longest-match, context-sensitive algorithm is based on the use of 5,276 
endings, each of which has an associated minimum stem length, either three or four 
characters, and one of eight action codes, which implement the context-sensitive rules. 
In addition, there are three types of recoding rule that are applied after suffix deletion. 
The effectiveness of this algorithm was tested in two phases. First, the stemming 
algorithm was applied to a large text corpus which contained 2,616 distinct word types. 
In this context, the level of compression and the success rate of suffix stripping were 
measured. In the second phase, the stemming algorithm was implemented within a 
best-match information retrieval system, and its retrieval performance evaluated. 
241 
If the level of compression is expressed in terms of the number of reduced words, 
then 54.7% compression was achieved by the employment of this algorithm, demon-
strating that it is a strong stemmer. However, this did not seem to adversely affect 
performance since a detailed inspection of resulting stems—as carried out by a trained 
intermediary—revealed that the success rate of suffix stripping was 90.8%. 
Results of this simple test have indicated that the procedures used in the stemming 
algorithm are workable and will yield good results with only minor changes. Although 
these alterations might involve the list of endings and occasionally the context-sensitive 
and recoding rules, the basic principles of the new stemming algorithm remain the same. 
However, in order to obtain the final results about its retrieval performance, the second 
phase of the experiment was carried out. 
9.2.2 Retrieval effectiveness of the stemming algorithm 
The retrieval performance of the stemming algorithm was tested on the basis of its 
employment in a Slovene language version of the text retrieval system INSTRUCT. 
The Slovene version of INSTRUCT was designed by the conversion of the original 
(PRIME) version of INSTRUCT to an IBM PC-compatible microcomputer using the 
TURBO PASCAL 5.5 programming language. This means that the Slovene version of 
INSTRUCT consists of the following main modules: 
• natural language query input (in Slovene); 
• elimination of non-content bearing terms from the query (using the dictionary of 
1,593 Slovene stop-words); 
• stemming of remaining query terms (using the Slovene stemming algorithm); 
• morphological term expansion using a string similarity measure; 
• best-match searching (with the possibility of imposing Boolean constraints after 
the initial search has been carried out); 
• relevance feedback searching; 
242 
• Boolean searching. 
The major alterations to the original version of INSTRUCT were carried out in the 
language-dependent modules in order to achieve the mam goal, i.e., the successful 
processing of Slovene terms both in queries and in documents. 
The performance effectiveness of the stemming algorithm was tested by its compari-
son with two other types of text representation: manual right-hand truncation, carried 
out by a trained intermediary, and non-stemming. Searches were carried out using 
these three types of text representation on a specially-created Slovene document test 
collection—the first such collection to be created—which contained 504 documents in 
the library and information science field, a set of 48 queries, and relevance judgments 
of retrieved documents. 
The retrieval effectiveness of the three different types of search was tested by apply-
ing well-known measures of recall and precision. In addition, the statistical significance 
of the differences were tested using the sign test and the Kendall coefficient of concor-
dance, W. 
The results of the comparative evaluation of the three different types of search 
revealed the following: 
• there is a significant performance difference between automatic word conflation 
and unstemmed processing of the Slovene text; 
• there is no significant performance difference between automatic stemming and 
manual right-hand truncation, carried out by a trained intermediary. 
It follows that one of the important components of an information retrieval system, 
i.e., word conflation, can be automated in Slovene systems with no average loss of 
performance, thus allowing users easier access to the systems. 
It is also interesting to note that the quite huge difference betvreen stemmed and 
unstemmed searches was caused mainly by the richness of the morphology of Slovene. 
Similar searches on English databases (see, for example, Harman, 1991) suggest that 
243 
automatic word conflation achieves only a slight improvement. This is an additional 
argument for the importance of an effective stemming algorithm within a Slovene in-
formation retrieval environment. 
9.2.3 Multi-lingual approach to document retrieval 
Having obtained good performance results with the employment of the Slovene stem 
ming algorithm, the next experiment (Experiment II) was carried out. Its main ob-
jective was to test the performance of statistically-based techniques in two difFerent 
languages, i.e., Slovene and English. It was hoped that the results of this experiment 
could serve as an important contribution towards solving the problem of a multi-lingual 
approach to document retrieval in Slovenia. 
Experiment II sought to accomplish the following two main tasks; firstly, to test 
whether a Slovene information retrieval system—using statistically-based techniques— 
can achieve the analogous retrieval effectiveness as obtained by an English system; 
and, secondly, to examine whether statistically-based techniques are suitable in de-
signing multi-lingual information systems. The same methodology as in the previous 
experiment was employed, i.e., a laboratory test was carried out, using the Slovene test 
collection and its English equivalent. Apart from measuring the retrieval performance 
of both systems, a detailed failure analysis was also employed. In addition, a Slovene 
and an English system were compared by their capability of producing semantically 
related stems to the selected query stem. As with Experiment I, this was the first 
such experiment carried out in Slovenia, and was also one of very few similar projects 
reported, so far, worldwide. 
A detailed analysis of performance results—expressed in terms of number of iden-
tical hits and by measuring recall and precision—has confirmed only one of three main 
hypotheses. Although the experiment on the identification of stem variants produced a 
similar number of related terms from both English and Slovene dictionary components 
of the inverted file, the other two hypotheses were rejected, i.e.: 
• processing of the English documents and queries did not produce more or less 
244 
identical hits to those retrieved from the Slovene database; 
• the Slovene version of INSTRUCT produced significantly better performance re-
sults than its English equivalent. 
These results were unexpected, and were investigated by the failure analvsis. This 
showed that the differences were due to the frequent occurrence of synonyms and other 
related terms, and to the behaviour of the two different stemming algorithms. This 
raises questions as to the suitability of statistically-based techniques for developing 
multi-lingual information systems, as implemented here. 
However, better results might be obtained by enhancing the simple strategies tested 
here with some other refinements such as a thesaurus look-up process, iteration search-
ing, a knowledge-based approach, etc. However, the effect of these techniques on the 
improvement of a simple statistically-based approach in a multi-lingual retrieval envi-
ronment has not been tested in Ezperiment //, mainly because of the small size of the 
test collection available to us. 
9.3 Suggestions for further work 
There is no doubt that this PhD project has clearly demonstrated the applicability of 
statistically-based retrieval techniques to a Slovene information retrieval environment. 
Apart from developing some other language-dependent techniques (e.g, a Slovene equiv-
alent to the Soundex code), the following are some suggestions which could lead to the 
increased use of advanced information retrieval techniques in Slovenia: 
• INSTRUCT as a teaching resource in information retrieval courses at the Depart 
ment of Librarianship, University of Ljubljana; 
• INSTRUCT as a demonstration package installed at the Computing Centre, Uni-
versity of Maribor, which acts as a host for a large number of databases; 
• incorporation of some INSTRUCT modules into existing Slovene retrieval pack-
ages, particularly those stili under development. 
245 
Hovvever, it should be noted that these advanced techniques of information retrieval 
will be firmly established in Slovenia only if they can be enhanced with refinements 
which allow a multi-lingual approach to document retrieval. There are only two million 
people living in Slovenia, and end-users are faced with document collections written not 
only in Slovene, but also either in other Yugoslav languages or in other major European 
languages. One of the main requirements of end-users in Slovenia will be to be able 
to input a query and to receive a query answer in Slovene language from any of these 
databases. It is clear from the results of this thesis that the provision of such facilities 
will require a very large amount of research. 
246 
Appendix A 
A list of consulted literature 
Bidwell, CA. (1969) Outline of Slovenian Morphologt/. Pittsburgh: Universitv of 
Pittsburgh. 
Hamp, E.P. (1975) On the dual inflections in Slovene. Slavistièna revija, Vol. 23, pp. 
67-70. 
Lencek, R.L. (1966) The Verb Pattern of the Contemporary Standard Slovene. Wies-
baden: Otto Harrassowitz. 
Lencek, R.L. (1982) The Structure and History of the Slovene Language. Columbia: 
Slavica. 
Paternost, J. (1963) The Slovenian Verbal Sgstem: Morphophonemics and Varia-
tions. PhD Thesis, Indiana Universitv. 
Rigler, J. (1966) Premene tonemov v oblikoslovnih vzorcih slovenskega knjižnega 
jezika. Jezik in slovstvo, Vol. 10, pp. 24-35. 
Tollefson, J.W. (1981) The Language Situation and Language Polièu in Slovenia. 
Washington: Universitv Press of Slovenia. 
Toporišiè, J. (1966) Esej o slovenskih besednih vrstah. Jezik in slovstvo, Vol. 10, pp. 
295-305. 
Toporišiè, J. (1967) Strukturiranost slovenskih glasov in predvidljivost njihove razvr 
stitve. Jezik in slovstvo, Vol. 11, pp. 92-96. 
Toporišiè, J. (1975) Main characteristics of the Slovene language. In: Komac, D. 
and Skerlj, R. English-Slovene and Slovene-English Dictionaru. Ljubljana: Cankarjeva 
založba, pp. 417-435. 
Toporišiè, J. (1978) A language of a small nationalitv in a multilingual state. Folia 
Slavica, Vol.l, pp. 480-487. 
Toporišiè, J. (1984) Slovenska slovnica. 2nd ed., Maribor: Obzorja. 
Vidoviè-Muha, A. (1988) Slovensko skladenjsko besedotvorje ob primerih zloženk. 
Ljubljana: Partizanska knjiga. 
247 
Appendix B 
The list of natural language 
queries 
The list of queries consists of 48 queries as provided by 8 users. 
1. Marketing (trženje) v knjižnicah, identiteta knjižnice in stiki z javnostjo. 
2. Knjižnice in informacijski centri v Veliki Britaniji in Indiji. 
3. Selektivna diseminacija informacij (SDI) in retrospektivne poizvedbe (RP). 
4. Klasificiranje (UDK - univerzalna decimalna klasifikacija), klasifikacijske sheme 
in klasifikacijski sistemi. 
5. Znanstveno-raziskovalno delo in razstavna dejavnost Referalnega centra Univerze 
v Zagrebu. 
6. Specializirani INDOK centri v Sloveniji in na Hrvaškem. 
7. Informacijsko-dokumentacijski centri (INDOK) v Jugoslaviji. 
8. Specializirani INDOK centri (SIC) v Sloveniji in Jugoslaviji. 
9. Raèunalniški programi (programska oprema, programski paketi in sistemi, soft-
ver) na podroèju knjižnièarstva in dokumentalistike. 
10. Indeksiranje (dokumentiranje) in sekundarni dokumenti (sekundarne publikacije) 
v knjižnièno-informacijskem sistemu (KIS) in sistemu znanstvenih in tehniènih 
informacij (SZTI). 
11. Mednarodni standardni bibliografski opis (ISBD) in UNISIST v Sloveniji in Ju 
goslaviji. 
12. Splošnoizobraževalno knjižnièarstvo v Ljubljani. 
248 
13. Zašèita in izboljšanje èlovekovega okolja - pomen informacij in dokumentov ter 
delovanje INDOK centra v Zagrebu. 
14. Pomen bibliometrije in analize citatov pri evalvaciji kvalitete znanstvenih del in 
èasopisov. 
15. Muzeologija, muzejska dejavnost in muzejski eksponati. 
16. Raziskovalna dejavnost na podroèju bibliotekarstva in informatike v Sloveniji ter 
razvoj bibliotekarske vede (stroke). 
17. Modeli za doloèanje zanesljivosti softvera. 
18. Teorija in modeli samoupravnega javnega komuniciranja. 
19. Vsebinska (predmetna) obdelava in avtomatizacija. 
20. Problematika iskanja informacij (poizvedbe). 
21. Standardi v knjižnièarstvu in knjižnicah. 
22. Mikroraèunalniki v knjižnicah in INDOK centrih. 
23. Obvezni izvod in slovenska nacionalna bibliografija. 
24. Domoznanske zbirke. 
25. Izobraževanje (šolanje, vzgoja) knjižnièarskih kadrov (bibliotekarjev, bibliotekarskih 
kadrov) - uèni naèrti in strokovni izpiti. 
26. Projektiranje (planiranje, naèrtovanje) biblioteènih (knjižniènih) stavb - gradnja 
(izgradnja), prostori, notranja oprema in ureditev. 
27. Zakonodaja in zakoni o knjižnièarstvu v Sloveniji. 
28. Delo z bralci, knjižna vzgoja, branje v šolskih, pionirskih in mladinskih knjižnicah. 
29. Povezovanje (sodelovanje) šolskih (pionirskih, mladinskih) knjižnic s splošnoizo-
braževalnimi knjižnicami (SIK-i). 
30. Èasniki (èasopisi, serijske oz. periodiène publikacije) in mikrofilmanje. 
31. Informacijska služba v knjižnicah. 
32. Podatkovne zbirke (baze podatkov) v družboslovnih in humanistiènih vedah. 
33. Medbiblioteèna (medknjižnièna) izposoja. 
34. Bralci in uporabniki v splošnoizobraževalnih knjižnicah. 
35. Citatna analiza (analiza citatov). 
36. Karkoli o Narodni in univerzitetni knjižnici v Ljubljani. 
249 
37. Raèunalniška obdelava (avtomatizacija) nacionalnih bibliografij. 
38. Metode poizvedovanja (iskanja) v bibliografskih bazah podatkov (podatkovnih 
zbirkah). 
39. Analiza uporabnikov v knjižnicah. 
40. Pomen mikrofilma in mikrofilmanja v knjižnicah in INDOK centrih. 
41. Specializirane baze podatkov (podatkovne zbirke) v Jugoslaviji. 
42. Avtomatizacija poslovanja nacionalnih knjižnic. 
43. Knjižnice - koordinacija - nabavna politika (nakupi). 
44. Publikacije in informacije - univerzalna (splošna) dostopnost. 
45. Bibliografije - katalogi - normativna kontrola - normativne datoteke. 
46. Univerzitetne (univerzne, visokošolske) knjižnice - standardi. 
47. Predmetni katalog - indeksiranje - tezaver. 
48. Centralni katalog - avtomatizacija. 
250 
Appendix C 
The list of queries as processed 
by the trained intermediary 
The order of this list corresponds to the list in Appendix B. The question mark '?' 
indicates at which point the trained intermediary removed an ending from the word. 
1. Market? (trg? trž?) knjižni? identi? knjižni? stik? javnost? 
2. Knjižni? informac? cent? Velik? Britan? Indij? 
3. Selektiv? disemin? informacij? (SDI) retrospekt? poizved? (RP) 
4. Klasifi? (UDK, univerzaln? decimaln? klasih?) klasih? shem? klasih? sistem? 
5. Znanstven? raziskoval? del? razstav? dejavn? Referaln? cent? Univerz? 
Zagreb? 
6. Special? INDOK cent? Sloven? Hrvašk? (Hrvat?) 
7. Informac? dokument? cent? (INDOK) Jugosl? 
8. Special? INDOK cent? (SIC) Sloven? Jugosl? 
9. Raèunaln? program? (program? oprem? program? paket? sistem? softver?) 
knjižni? dokument? 
10. Indeks? (dokument?) sekund? dokument? (publik?) knjižn? informac? sistem? 
(KIS) sistem? znanstv? tehn? informacij? (SZTI) 
11. Mednarodn? standard? bibliografsk? opis? (ISBD) UNISIST Slovenij? Jugosl? 
12. Splošnoizobraževaln? knjižnièar? Ljubljan? 
13. Zašèit? izboljš? èlovek? okol? pomen inform? dokument? delov? INDOK cent? 
Zagreb? 
251 
14. Pomen bibliomet? anali? citat? evalv? kvalit? znanstv? del èasopis? 
15. Muzeol? muzej? dejavnost? muzej? eksponat? 
16. Raziskov? dejavnost? podroè? bibliotekar? informatik? Slovenij? razvoj bib-
liotekars? ved? (strok?) 
17. Model? doloè? zanesljivost? softver? 
18. Teor? model? samoupravn? javn? komunic? 
19. Vsebin? (predmet?) obdel? avtomat? 
20. Problem? iskan? inform? (poizved?) 
21. Standard? knjižni? 
22. Mikroraèun? knjižni? INDOK cent? 
23. Obvezn? izvod? slovensk? nacional? bibliograf? 
24. Domoznan? zbirk? 
25. Izobra? (šolan? vzgoj?) knjižni? kad? (bibliotekar? kad?) uè? naèrt? strokov? 
izpit? 
26. Projekt? (plan? naèrt?) bibliot? (knjižni?) stavb? grad? (izgrad?) prosto? 
notran? oprem? uredit? 
27. Zakon? knjižni? Sloven? 
28. Delo? bral? knjižn? vzgoj? bran? šolsk? pionir? mladin? knjižni? 
29. Povez? (sodel?) šolsk? (pionir? mladin?) knjižni? splošnoizobraževal? knjižni? 
(SIK) 
30. Èasnik? (èasopis? serij? period? publik?) mikrofilm? 
31. Inform? služb? knjižni? 
32. Podatk? zbir? (baz? podatk?) družboslov? humani? ved? (znan?) 
33. Medbiblioteèn? (medknjižnièn?) izposoj? 
34. Bral? uporabn? splošnoizobraževaln? knjižni? 
35. Citat? anali? 
36. Narodn? univerz? knjižnic? Ljubljan? (NUK) 
37. Raèunaln? obdel? (avtomat?) nacion? (narod?) bibliograf? 
38. Metod? poizved? (iskan?) bibliograf? baz? (podatk? zbirk?) 
252 
39. Anali? uporabn? knjižn? 
40. Pomen mikrofilm? knjižni? INDOK centr? 
41. Special? baz? podatk? (podatk? zbir?) Jugosl? 
42. Avtomat? poslov? nacionaln? knjižni? 
43. Knjižni? - koordin? nabav? politik? (nakup?) 
44. Publi? inform? - univerzaln? (splošn?) dostop? 
45. Bibliograf? - katalog? - normat? kontrol? - normat? datotek? 
46. Univerzitet? (univerz? visokošol?) knjižni? - standard? 
47. Predmet? katalog? - indeks? -težav? 
48. Centraln? katalog? - avtomat? 
253 
Appendix D 
The list of English language 
queries 
This list corresponds to the list of queries in Slovene language, as presented in Appendix 
B. 
1. Marketing in libraries, library's identity and public relation. 
2. Libraries and information centres (centers) in Great Britain and India. 
3. Selective dissemination of information (SDI) and retrospective searches. 
4. Classification (UDC - Universal Decimal Classification), classification schemes, 
classification systems. 
5. Research work and exhibition activities of the Referral Centre of the University 
of Zagreb. 
6. Specialized INDOC centres (centers; information and documentation services) in 
Slovenia and Croatia. 
7. Information and documentation services (INDOC centres; centers) in Yugoslavia. 
8. Specialized INDOC centres (centers; information and documentation services) in 
Slovenia and Yugoslavia. 
9. Software (software packages and systems, program packages) in librarianship and 
documentation. 
10. Indexing (documentation) and secondary documents (secondary publications) in 
library and information services and in scientific and technical information ser 
vices. 
11. International Standard Bibliographic Description (ISBD) and UNISIST in Slove 
nia and Yugoslavia. 
254 
12. Public librarianship in Ljubljana. 
13. Protection and improvement of the human environment — the importance of 
information and documents and activity of the INDOC Centre in Zagreb. 
14. Bibliometrics and citation analysis in evaluation of scientific papers and periodi-
cals. 
15. Museology, museum activities, and museum exhibits. 
16. Research work in the field of librarianship and information science in Slovenia 
and development of library science (profession). 
17. Software reliability models. 
18. Theory and models of self-management mass communication. 
19. Subject (content) analysis and automation. 
20. Information retrieval and searching. 
21. Standards in libraries and librarianship. 
22. Microcomputers in libraries and INDOC centres (centers; information and docu-
mentation services). 
23. The deposit copy and the Slovene National Bibliography. 
24. Local (ethnographic, demographic) collections. 
25. Training and education of librarians (library staff, personnel) — educational pro-
grammes (curriculums) and examination regulations. 
26. Planning and design of library buildings — construction, space requirements, 
interior equipment and layout. 
27. Legislation and legal acts to do with librarianship and libraries in Slovenia. 
28. Book and literary education, readers and reading in school, pioneers' and youth 
libraries. 
29. Cooperation (co-operation) between school (pioneers', youth) libraries and public 
libraries. 
30. Newspapers (periodicals, journals, serial publications) and microfilming. 
31. Information services in libraries. 
32. Data bases (databases) in social sciences and humanities. 
33. Interlibrary lending (inter-library loan). 
34. Readers and users in public libraries. 
255 
35. Citation analysis. 
36. Anything about the National and University Library in Ljubljana. 
37. Computer-based (automated) processing of national bibliographies. 
38. Information retrieval methods for searching in bibliographic databases (data bases, 
collections). 
39. User studies (surveys) in libraries. 
40. Microfilm and microfilming in libraries and INDOC centres (centers; information 
and documentation services). 
41. Specialized databases (data bases, collections) in Yugoslavia. 
42. National libraries and automation. 
43. Libraries — co-ordination (coordination) — acquisition policy. 
44. Publications and information — universal (general) availability. 
45. Bibliography — catalogues — authority control — authority files. 
46. University libraries — standards. 
47. Subject catalogue — indexing — thesaurus. 
48. Union catalogue — automation. 
256 
REFERENCES 
Al-Hawamdeh, S. and Willett, P. (1989) Paragraph-based nearest neighbour search-
ing in full-text documents. Electronic Publishing, Vol. 2, pp. 179-192. 
Angeli, R.C. et al. (1983) Automatic spelling correction using a trigram similarity 
measure. Information Processing and Management, Vol. 19, pp. 255-261. 
Ashford, J. and Willett, P. (1988) Text Retrieval and Document Databases. Brom-
ley: Chartwell-Bratt. 
Bar Hillel, Y. (1962) Theoretical aspects of the mechanization of literature searching. 
In: W. Hoffman (ed). Digitale Informationsivandler. Braunschweig: Vieweg and Sons, 
pp. 406-443. 
Bawden, D. (1986) Information systems and the stimulation of creativity. Journal of 
Information Science, Vol. 12, pp. 203-216. 
Bawden, D. (1990) User-oriented Evaluation of Information Svstems and Services. 
London: Gower. 
Bell, C.L. and Jones, K.P. (1976) A minicomputer retrieval system with automatic 
root nnding and roling facilities. Program, Vol. 10, pp. 14-27. 
Bidwell, CA. (1969) Outline of Slovenian Morphologg. Pittsburgh: University of 
Pittsburgh. 
Biru, T. et al. (1989) Inclusion of relevance information in the term discrimination 
model. Journal of Documentation, Vol 45, pp. 85-109. 
Booth, A.D. (1967) A "law" of occurrences for words of low frequency. Information 
and Control, Vol. 10, pp. 386-393. 
Brzozowski, J.P. (1983) MASQUERADE: searching the full text of abstracts using 
automatic indexing. Journal of Information Science, Vol. 6, pp. 67-73. 
Carroll, D.M. et al. (1988) Bibliographic pattern matching using the ICL Distributed 
Array Processor. Journal of the American Societg for Information Science, Vol. 39, 
pp. 390-399. 
Cercone, N. (1978) Morphological analysis and lexicon design for natural-language 
processing. Computers and the Humanities, Vol. 11, pp. 235-258. 
Chiaramella, Y. and Defude, B. (1987) A prototype of an intelligent system for 
information retrieval: IOTA. Information Processing and Management, Vol. 23, pp. 
285-303. 
Cleverdon, C.W. (1966) Factors Determining the Performance of Indezing Svstems. 
Cranfield: College of Aeronautics. 
Cleverdon, C (1984) Optimizing convenient online access to bibliographic databases. 
Information Services and Vse, Vol. 4, pp. 37-47. 
257 
Cooper, D. and Lynch, M.F. (1979) Compression of Wiswesser line notations using 
variety generation. Journal of Chemical Information and Computer Sciences, Vol. 19, 
pp. 165-169. 
Croft, W.B. and Harper, D.J. (1979) Using probabilistic models of document re-
trieval vvithout relevance information. Journal of Documentation, Vol. 35, pp. 285-295. 
Cuadra, CA. and Katter, R.V. (1967) Opening the black box of "relevance". Jour 
nal of Documentation, Vol. 23, pp. 291-303. 
Dawson, J.L. (1974) Suffix removal and word conflation. ALLC Bulletin, Vol. 2, pp. 
33-46. 
Dimec, J. (1988) Raèunalniška analiza slovenskega jezika v medicini (A Computer 
Analysis of Slovene Language in Medicine). M.Se. thesis, University of Ljubljana. 
Dolby, J.L. and Resnikoff, H.L. (1964) On the strueture of written English. Lan 
guage, Vol. 40, pp. 167-196. 
Doszkocs, T.E. (1983) CITE NLM: natural language searching in an online catalog. 
Information Technologg and Libraries, Vol. 2, pp. 364-380. 
Dubois, C.P.R. (1979) Multilingualinformation systems: Some criteriafor the choice 
of specific techniques. Journal of Information Science, Vol. 1, pp. 5-12. 
Ellis, D. (1987) The Derivation of a Behavioural Model for Information Retrieval Sys-
tem Design. Ph.D thesis, University of Sheffield. 
Ellis, D. (1990) New Horizons in Information Retrieval. London: The Library Asso-
ciation. 
Fagan, J.L. (1989) The effectiveness of a nonsyntactic approach to automatic phrase 
indexing for document retrieval. Journal of the American Societg for Information Sci 
ence, Vol. 40, pp. 115-132. 
Field, B.J. (1975) Semi-automatic Development of Thesauri Using Free-language Vo-
cabularu Analgsis (Part 1 only). Report no. R75/24 : INSPEC. 
Field, B.J. (1977) Automatic indexing for multilingual systems. Third European 
Congress on Information Sgstems and Netmorks: Overcoming the Language Barrier. 
London: Saur, pp. 469-492. 
Frakes, W.B. (1984) Term conflation for information retrieval. In: C.J. van Rijsber-
gen (ed). Research and Development in Information Retrieval. Cambridge: CUP, pp. 
383-390. 
Freund, G.E. and Willett, P. (1982) Online identification of word variants and arbi-
trary truncation searching using a string similarity measure. Information Technologg: 
Research and Development, Vol. 1, pp. 177-187. 
Fuhr, N. (1990) Zur Ubervvindung der Diskrepanz zwischen Retrievalforschung und 
-praxis. Nachrichten fur Dokumentation, Vol. 41, pp. 3-7. 
258 
Goldsmith, N. (1982) An appraisal of factors afFecting the performance of text re-
trieval systems. Information Technology: Research and Development, Vol. l,pp. 41-53. 
Griffiths, A. et al. (1984) Hierarchic agglomerative clustering methods for automatic 
document classification. Journal of Documentation, Vol. 40, pp. 175-205. 
Griffiths, A. et al. (1986) Using interdocument similarity information in document 
retrieval systems. Journal of the American Society for Information Science, Vol. 37, 
pp. 3-11. 
Hafer, M.A. and Weiss, S.F. (1974) Word segmentation by letter successor vari-
eties. Information Storage and Retrieval, Vol. 10, pp. 371-385. 
Harman, D. (1987) A failure analysis on the limitations of suffixing in an online 
environment. Proceedings of the Tenth International Conference on Research and De 
velopment in Information Retrieval. Washington: ACM, pp. 102-108. 
Harman, D. (1991) How effective is sufflxing? Journal of the American Society for 
Information Science, Vol. 42, pp. 7-15. 
Harter, S.P. (1975) A probabilistic approach to automatic keyword indexing. Part 
II. An algorithm for probabilistic indexing. Journal of the American Society for Infor 
mation Science, Vol. 26, pp. 280-289. 
Hendry, I.G. et al. (1986a) INSTRUCT: a teaching package for experimental meth 
ods in information retrieval. Part 1. The users' view. Program, Vol. 20, pp. 245-263. 
Hendrv, I.G. et al. (1986b) INSTRUCT: a teaching package for experimental meth 
ods in information retrieval. Part 2. Computational aspects. Program, Vol. 20, pp. 
129-151. 
Hildreth, C.R. (1982) Online browsing support capabilities. Proceedings of the ASIS 
Annual Meeting 19. VVhite Plains, New York: Knowledge Industry Publications Inc., 
pp. 127-132. 
Institute for Information Science, University of Maribor (1990) ATLASS in 
sistem vzajemne katalogizacije. Maribor: University of Maribor. 
Jappinen, H. et al. (1985) FINNTEXT—text retrieval system for an agglutinative 
language. RIAO 85 Recherche d'Informations, Grenoble, pp. 217-226. 
Jones, K.P. and Bell, C.L.M. (1984) The automatic extraction of words from text 
especially for input into information retrieval systems based on inverted files. In: C.J. 
van Rijsbergen (ed.) Research and Development in Information Retrieval. Cambridge: 
CUP, pp. 409-419. 
Keen, E.M. (1991a) The use of term position devices in ranked output experiments. 
Journal of Documentation, Vol. 47, pp. 1-22. 
Keen, E.M. (1991b) The effect of stemming strength on the effectiveness of output 
ranking. Paper given at Informatics 11, March 1991, 13 p. 
Kimberley, R. (ed.) (1987) Text Retrieval: A Directory of Softivare. 2nd edition. 
Aldershot: Gower. 
259 
Kosmaè, C. (1953) Pomladni dan. Ljubljana: Državna založba. 
Lancaster, F.W. (1969) MEDLARS: Report on the evaluation of its operating effi-
ciencv. American Documentation, Vol. 20, pp. 119-142. 
Lencek, R.L. (1966) The Verb Pattern of the Contemporarjj Standard Slovene. Wies-
baden: Otto Harrassowitz. 
Lencek, R.L. (1982) The Structure and History of the Slovene Language. Columbia: 
Slavica. 
Lennon, M. et al. (1981) An evaluation of some conflation algorithms for information 
retrieval. Journal of Information Science, Vol. 3, pp. 177-183. 
Lesk, M.E. and Salton, G. (1969) Relevance assessments and retrieval svstem eval 
uation. Information Storage and Retrieval, Vol. 4, pp. 343-359. 
Lovins, J.B. (1968) Development of a stemming algorithm. Mechanical Translation 
and Computational Linguistics, Vol. 11, pp. 22-31. 
Lovins, J.B. (1971) Error evaluation for stemming algorithms as clustering algorithms. 
Journal of the American Societg for Information Science, Vol. 22, pp. 28-40. 
Lowe, T.C. et al. (1973) Additional Text Processing for On-line Retrieval (The RAD-
COL Sgstem). Technical Report RADC-TR-73-337. 
Luhn, H.P. (1957) A statistical approach to mechanised encoding and searching of 
librarv information. IBM Journal of Research and Development, Vol.l, pp. 309-317. 
Luhn, H.P. (1958) The automatic creation of literature abstracts. IBM Journal of 
Research and Development, Vol. 2, pp. 159-165. 
Lynch, M.F. (1977) Varietv generation - A reinterpretation of Shannon's mathemat-
ical theorv of communication and its implications for Information Science. Journal of 
the American Societg for Information Science, Vol. 28, pp. 19-24. 
Marcus, R.S. (1983) An experimental comparison of the effectiveness of computers 
and humans as search intermediaries. Journal of the American Societg for Information 
Science, Vol. 34, pp. 381-404. 
Markev, K. (1983) Online Catalogue Use: Results of Survegs and Focus Group Inter-
views in Several Libraries. Vol. II. OCLC Online Computer Librarv Center. 
Martinovic, S. (1985) Automatizovani višejezièni tezaurus. Informatika, Vol. 13, pp. 
21-35. 
McCain K.W. et al. (1987) Comparing retrieval performance in online data bases. 
Information Processing and Management, Vol. 23, pp. 539-553. 
McCall, F.M. and "VVillett, P. (1986) Criteria for the selection of search strategies in 
best match document retrieval svstems. International Journal of Man-Machine Stud-
ies, Vol. 25, pp. 317-326. 
Mohan, K.C. (1987) Choice of Retrieval Technigues for a Multi-Strategg Retrieval 
Svstem. PhD thesis, Sheffield: Universitv of Sheffield. 
Murtagh, F. (1983) A survev of recent advances in hierarchical clustering algorithms. 
260 
The Computer Journal, Vol. 26, pp. 354-359. 
Nelis, K. (1985) Human Interaction tvith Computers in an Information Retrieval Con-
text: a study of the Users' Interaction tvith INSTRUCT. MSc Dissertation, Shemeld: 
Universitv of ShefReld. 
Niedermair, G.T. et al. (1985) MARS: a retrieval tool on the basis of morphological 
analvsis. In: C.J. van Rijsbergen (ed). Research and Development in Information Re 
trieval. Cambridge: CUP, pp. 369-380. 
Noreault, T. and Chatham, R. (1982) A procedure for the estimation of term sim-
ilaritv coefflcients. Information Technology: Research and Development, Vol. 1, pp. 
189-196. 
Noreault, T. et al. (1977) Automatic ranked output from Boolean searches in SIRE. 
Journal of the American Society for Information Science, Vol. 28, pp. 333-339. 
Overhage, C.F.J. and Reintjes, J.F. (1974) Project INTREX: A general review. 
Information Storage and Retrieval, Vol. 10, pp. 157-188. 
Pape, D.L. and Jones, R.L. (1988) STATUS with IQ - escaping from the Boolean 
straitjacket. Program, Vol. 22, pp. 32-43. 
PARALOG (1990) A G uide for TRIP Managers, Version 2.4. Stockholm: PARA-
LOG. 
Paternost, J. (1963) The Slovenian Verbal System: Morphophonemics and Varia-
tions. PhD Thesis, Indiana Universitv. 
Perry, S.A. and "VVillett, P. (1983) A revievv of the use of inverted files for best 
match searching in information retrieval svstems. Journal of Information Science, Vol. 
6, pp. 59-66. 
Pogue, C. and Willett, P. (1984) An evaluation of document retrieval from serial 
files using the ICL Distributed Array Processor. Online Revietv, Vol. 8, pp. 569-584. 
Pollitt, A.S. (1986) An expert system approach to document retrieval: a summary 
of the CANSEARCH Research Project. Technical Report Series (86/6): Huddersfield 
Polvtechnic. 
Pollock, J.J. and Zamora, A. (1984) Automatic spelling correction in scientific and 
scholarlv text. Communications of the ACM, Vol. 27, pp. 358-368. 
Popoviè, M. and Willett, P. (1990) Processing of documents and queries in a Slo-
vene language free text retrieval system. Literary and Linguistic Computing, Vol. 5, 
pp. 182-190. 
Porter, M.F. (1980) An algorithm for suffix stripping. Program, Vol. 14, pp. 130-137. 
Porter, M.F. (1982) Implementing a probabilistic retrieval svstem. Information Tech-
nology: Research and Development, Vol. 1, pp. 131-156. 
Porter, M.F. (1983) Information retrieval at the Sedgwick Museum. Information 
Technology: Research and Development, Vol. 2, pp. 169-186. 
Porter, M.F. and Galpin, V. (1988) Relevance feedback in a public access catalogue 
261 
for a research librarv: Muscat at the Scott Polar Research Institute. Program, Vol. 22, 
pp. 1-20. 
Research Community of Slovenia (1989) Sistem znanstvenega in tehniènega in 
formiranja v Sloveniji. Ljubljana: Raziskovalna skupnost Slovenije. 
Robertson, S.E. (1981) The methodologv of information retrieval experiment. In: 
Sparck Jones, K. (ed.) Information Retrieval Experiment. London: Butterworths, pp. 
9-31. 
Robertson, S.E. (1986) On relevance weight estimation and query expansion. Jour 
nal of Documentation, Vol. 42, pp. 182-188. 
Robertson, S.E. (1990) On sample sizes for non-matched-pair IR experiments. In 
formation Processing and Management, Vol. 26, pp. 739-753. 
Robertson, S.E. and Sparck Jones, K. (1976) Relevance weighting of search terms. 
Journal of the American Societg for Information Science, Vol. 27, pp. 129-146. 
Rolland-Thomas, P. and Mercure, G. (1989) Subject access in a bilingual online 
catalogue. Cataloguing and Classification Quarterly, Vol. 10, pp. 141-150. 
Sager, J.C. et al. (1982) Thesaurus integration in the social sciences. Part III: Guide-
lines for the integration of thesauri. International Classification, Vol. 9, pp. 64-70. 
Salton, G. (1969) Automatic processing of foreign language documents. In: G. Salton 
(ed.) Information Storage and Retrieval. Report ISR-16 to the National Science Foun 
dation, Department of Computer Science, Cornell Universitv, Itaca, New York, pp. 
IV/l-IV/30. 
Salton, G. (1971) The SMART Retrieval System—Experiments In Automatic Docu-
ment Processing. Englewood Cliffs, N.J.: Prentice Hali. 
Salton, G. (1975) Dvnamic Information and Librarrj Processing. Englewood Cliffs: 
Prentice-Hall. 
Salton, G. (1986) Recent trends in automatic information retrieval. Proceedings of 
the Ninth International Conference on Research and Development in Information Re 
trieval. Washington: ACM, pp. 1-10. 
Salton, G. (1988) Thoughts about modem retrieval technologies. Information Ser 
vices and Vse, Vol. 8, pp. 107-113 
Salton, G. and McGill, M.J. (1983) Introduction to Modem Information Retrieval. 
New York: McGraw-Hill. 
Salton, G. et al. (1975) A theorv of term importance in automatic text analvsis. 
Journal of the American Societg of Information Science, Vol. 26, pp. 33-44. 
Salton, G. et al. (1983) Extended Boolean information retrieval. Communications of 
the ACM, Vol. 26, pp. 1022-1036 
Siegel, S. and Castellan, N.J. (1988) Nonparametric Statistics for the Behavioral 
Sciences. New York: McGraw-Hill. 
Smeaton, A.F. (1990) Natural language processing and information retrieval. Infor-
262 
mation Processing and Management, Vol. 26, pp. 19-20. 
Sparck Jones, K. (1972) A statistical interpretation of term specificitv and its appli-
cation in retrieval. Journal of Documentation, Vol. 28, pp. 11-21. 
Sparck Jones, K. (ed.) (1981) Information Retrieval Experiment. London: Butter-
worths. 
Sparck Jones, K. and Bates, R.G. (1977) Report on a Design Study for the "Ideal" 
Information Retrieval Test Collection. British Librarv R&DD Report No. 5428. 
Sparck Jones, K. and Tait, J.I. (1984) Automatic search term variant generation. 
Journal of Documentation, Vol. 40, pp. 50-66. 
Stibic, V. (1980) Influence of unlimited ranking on practical online search strategv. 
Online Review, Vol. 4, pp. 273-278. 
Tague, J.M. (1981) The pragmatics of information retrieval experimentation. In: 
Sparck Jones, K. (ed.) Information Retrieval Experiment. London: Butterworths, pp. 
59-102. 
Tancig, P. (1985) Raèunalniško razumevanje slovenskega jezika. PhD thesis, Univer-
sity of Ljubljana. 
Tarry, B.D. (1978) Automatic Suffix Generation and Word Segmentation for Infor 
mation Retrieval. M.Se. thesis, University of Sheffield. 
Tenopir, C. (1984) Full text databases. Annual Reviem of Information Science and 
Technologg, Vol. 19. New Vork: Elsevier Science Publishers, pp. 215-246. 
Tollefson, J.W. (1981) The Language Situation and Language Policg in Slovenia. 
Washington: University Press of Slovenia. 
Toporišiè, J. (1975) Main characteristics of the Slovene Language. In: Komac, D. 
and Skerlj, R. English-Slovene and Slovene-English Dictionarg. Ljubljana: Cankarjeva 
založba, pp. 417-435. 
Toporišiè, J. (1984) Slovenska slovnica. 2nd ed., Maribor: Obzorja. 
Ulmschneider, J.E. and Doszkocs, T. (1983) A practical stemming algorithm for 
online search assistance. Online Review, Vol. 7, pp. 301-318. 
van Rijsbergen, C.J. (1979) Information Retrieval. 2nd ed., London: Buttervrorths. 
Vickery, A. et al. (1987) A reference and referral system using expert system tech-
niques. Journal of Documentation, Vol. 43, pp. 1-23. 
Vidoviè-Muha, A. (1988) Slovensko skladenjsko besedotvorje ob primerih zloženk. 
Ljubljana: Partizanska knjiga. 
Wade, S.J. and Willett, P. (1988). INSTRUCT: a teaching package for experimen-
tal methods in information retrieval. Part 3. Browsing, clustering and query expansion. 
Program, Vol. 22, pp. 44-61. 
Wade, S.J. et al. (1988) A comparison of knowledge-based and statistically-based 
techniques for reference retrieval. Online Revieui, Vol. 12, pp. 91-108. 
263 
Wade, S.J. et al. (1989) SIBRIS: the Sandwich Interactive Browsing and Ranking 
Information Svstem. Journal of Information Science, Vol. 15, pp. 249-260. 
Walker, S. and Jones, R.M. (1987) Improving Subject Retrieval in Online Cata-
logues: 1. Stemming, Automatic Spelling Correction and Cross-Reference Tables. Lon 
don: British Librarv. (British Librarv Research Paper 24). 
Wenzel, F. (1980) Semantische Eingrenzung im Freitext-Retrieval auf der Basis mor-
phologischer Segmentierungen. Nachrichten fur Dokumentation, Vol. 31, pp. 29-35. 
Willett, P. (1981) A fast procedure for the calculation of similaritv coefficients in 
automatic classification. Information Processing and Management, Vol. 17, pp. 53-60. 
"VVillett, P. (1985) Use of ranking methods in searches of textual and structural data 
bases. Proceedings of the Ninth International Online Information Meeting. Oxford: 
Learned Information, pp. 343-353. 
"VVillett, P. (ed). (1988a) Document Retrieval Svstems. London: Tavlor Graham. 
"VVillett, P. (1988b) Recent trends in a hierarchic document clustering: a critical re-
view. Information Processing and Management, Vol. 24, pp. 577-597. 
"VVillett, P. and Wood, F.E. (1989) Use of the INSTRUCT text retrieval program 
at the Department of Information Studies, Universitv of Sheffield. Education for In 
formation, Vol. 7, pp. 133-141. 
"VVilliams, M.E. (1985) Electronic databases. Science, Vol. 228, pp. 445-456. 
"VVolpert, S.A. (1983) A command language for the executive. Information Services 
and Use, Vol. 3, pp. 261-272. 
Wood, F.E. (1981) Online teaching aids from the Department of Information Studies, 
Universitv of Sheffield. Online Revievo, Vol. 5, pp. 487-494. 
Wood, F.E. (1984) Teaching online information retrieval in United Kingdom librarv 
schools. Journal of the American Societg for Information Science, Vol. 35, pp. 53-55. 
Zipf, H.P. (1965) Human Behavior and the Principle of Least Effort. 2nd ed., New 
York: Hafner Publishing Companv. 
264