Compiling and Using the IJS-ELAN Parallel Corpus 
Tomaž Erjavec 
Department of Intelligent Systems 
Jožef Štefan Institute 
Jamova 39 
SI-1000 Ljubljana 
Slovenia 
tomaz.erjavec@ijs.si, http://nl.ijs.si/et/ 
Keywords: natural language processing, corpus annotation, multilinguality, Iexicon extraction 
Received: July 10,2002 
With increasing amounts oftext being available in electronic form, it is becoming relatively easy to obtain 
digital texts together with their translations. The paper presents the processing steps necessary to compile 
such texts into parallel corpora, an extremely useful language resource. Parallel corpora can be used as 
a translation aid for second-Ianguage leamers, for translators and lexicographers, or as a data-source for 
various language technology tools. We present our work in this direction, which is characterised by the use 
ofopen standards for text annotation, the use ofpublicly available third-party tools and wide availability 
of the produced resources. Explainedis the corpus annotation chain involving normalisation, tokenisation, 
segmentation, alignment, word-class syntactic tagging, and lemmatisation. Two exploitation results over 
our annotated corpora are also presented, namely a Web concordancer and the extraction of bi-lingual 
lexica. 

 Introduction 
With more and more text being available in electronic form, it is becoming easy to obtain large quantities of digital texts and to process them computationally. If a collection of such texts is chosen according to specific criteria and is con­sistently and correctly marked-up [19], it is said to be a text corpus. Such corpora can be used for a variety of dif­ferent purposes [17], from empirically grounded linguistic studies, Iexicography and language teaching, to providing datasets for language technology programs for terminology extraction, word-sense disambiguation, etc. 
CoUected and uniformly encoded collections of texts are ah'eady quite useful, but it is the addition of (linguistic) markup that makes corpora a prime resource for language exploration. As will be seen, we view the process of com­piling a corpus as one of annotation accrual: starting from a plain text, we successively add mark-up, thereby enrich­ing the information contained in the corpus. This markup is typically automaticalIy produced, but can be hand vali­dated and, in a cyclic process, can serve for inductive pro­grams to learn better models of the language with which to annotate subsequent generations of corpora. The added annotation enables the people and software using the cor­pus to employ extra levels of abstraction, leading to better exploitation results. 
If monolingual corpora are already useful for a variety of purposes, it is multilingual corpora that open the way for empirical study of the translation process. Especially valuable are so called parallel corpora, i.e., corpora consist­ing of texts together with their translation into one or many languages. They can be used directly as translation aids for humans or can provide data for the automatic induction translation resources (lexica) and software (machine trans­lation). 
In this paper we explore the process of compilation and exploitation of such parallel corpora, grounding the discus­sion in our experience with two annotated parallel corpora: the 7-language MULTEXT-East corpus [6, 9], which con­tains the novel "1984" by G. Orwell (100,000 words per language) and has had its annotation manually vaUdated; and the larger (500,000 words per language) automatically annotated Slovene-English IJS-ELAN corpus [8, 10]. 
Our work is characterised by the use of open standards for text annotation and the use of publicly available third-party tools. We claim that it is better to invest labour into producing high-quality annotated corpora than in trying to build from scratch tools for such annotation. Unlike local and idiosyncratic software, linguistic resources encoded in a standard manner will be sooner useful to other research groups. Such largesse aside, there also exist more and more (statistical or symbolic) machine leaming programs that are able to induce language models from pre-annotated corpora. They are typically more robust and, with large enough training sets, might even perform better than hand crafted systems. 
The rest of this paper is structured as follovvs: Section 2 introduces standards for corpus annotation, which are then used in the examples in the remainder of the paper; Section 3 enumerates the basic (pre-linguistic) processing steps in­volved in the compilation of a corpus; Section 4 details the more complex word-Ievel syntactic annotation, which is performed by a trainable algorithm; Section 5 tums to the expIoitation of corpora and gives two examples: an on­line concordancer, and an experiment in bi-Iingual lexicon extraction; Section 6 gives conclusions and directions for further research. 

2 Encoding Standards 
While the question of the encoding format for corpora and other language resources might seem incidenta! to the main task of producing and exploiting the corpora, it has long been known that the proliferation of data formats and anno­tation schemes, many of which are proprietary and poorly documented, is a significant bottleneck for resource shar­ing, re-use and longevity. There have been therefore a num­ber of attempts to standardise the encoding of various lan­guage resources, among them corpora. 
Ali such standardisation efforts have as their basis the ISO Standard Generalized Markup Language, SGML, or, more recently, the W3C Extensible Markup Language, XML [26], a simplified form of SGML meant primarily for interchange of data on the Web. SGML and XML are metalanguages, that is, a means of formally describing a language, in this čase, a markup language. They do thus not directly define a particular set of tags but rather enable the mechanisms for defining such sets for particular pur­poses, e.g., for the encoding of corpora. 
The best known and widely used set of conventions for encoding a wide variety of texts, among them corpora, are the SGML-based Text Encoding Initiative Guidelines (TEI), the most recent version of which is also XML com­pliant [20]. The TEI consist of the formal part, which is a set of SGML/XML Document Type Definition fragments, and the documentation, which explains the rationale behind the elements available in these fragments, as well as giv­ing overall Information about the structure of the TEI. The DTD fragments are combined to suit an individual project, and, if necessary, also extended or modified. We have used parametrisations of TEI for both the MULTEXT-East cor­pus as well as for the IJS-ELAN corpus. TEI encoded ex­amples from these corpora will be used in the rest of this paper. 

3 Pre-processing the Corpus 
In this section we deal with the basic processing steps in­volved in normalising and marking-up the corpus, in order to make it minimally usefu! for exploitation and to prepare it for further annotation. The steps we outline below usu­ally proceed in sequence, and can be, for the most part, per­formed automatically allhough — given the unconstrained nature of texts — are Iikely to be less than 100% accu­rate. While further development of tools and associated re­sources lowers the error rate, manual validation might stili 
be necessary for high-quality corpora. The texts constituting a corpus can be collected from a variety of sources and so usually come in a range of for­mats. The first step in corpus preparation is therefore in­variably the normalisation of the texts into a common for­mat. Usually custom filters — written in pattem matching languages such as Perl — are employed to, on the one hand, normalise the character sets of the documents and, on the other, to remove and convert the formatting of the originals. 

3.1 Character sets 
As far as character sets go the corpus compliers have a few options at their disposal. One possibiHty is to use — at least for European languages — an 8-bit encoding in the corpus, preferably a standard one, e.g., ISO 8859 Latin 2 for encoding Slovene texts. While the advantage is that the corpus texts are immediately readable — given that we have installed the appropriate fonts — the disadvantage is that a number of processing applications do not handle well 8 bit characters and, more importantly, that it is impossible to mix languages that use different character sets; this is, of course, a special concem in multilingual corpora. 
Until a few years ago the standard solution involved translating the non-ASCII characters of the original texts into ISO-mandated SGML entities. SGML (and XML) en­tities are descriptive names, somevvhat like macros, which the application then substitutes for their values. In our čase, the names are of the characters in question; so, for example, the Slovene letter č is vvritten as the entity &ccaron ; (small c with acaron), whereampersandstarts an entity and semicolon ends it. Such entities for char­acters are defined in public entity sets, for the čase of Sccaron ; in ISO 8879:1986//ENTITIE S Adde d Lati n 2//EN , i.e., the added entity set for encoding the Latin alphabets of Eastern European languages. Similar entity sets exist for "VVestem European languages (Latin 1), for Greek, Russian and non-Russian Cyrillic, as well as for mathematical relations, publishing symbols, etc. The ad­vantage of using entities is their robustness: because they are encoded in (7 bit) ASCII characters they are portable, and can be read on any platform. For the application, the SGML processor then translates them into the desired en­coding via their definitions. 
With the advent of XML, this solution has somevvhat lost its currency. While entities are supported in XML, the de­fault character set of XML is Unicode, which, because a character can be encoded in two bytes, is sufficient to ac­commodate most of the characters of the world's scripts; so, this is in fact the only solution for Easter language scripts, such as Kanji. But while using Unicode makes for a theoretically good solution, practice stili lags behind, with many applications not yet being Unicode-aware. In our cor­pora we have therefore chosen to use SGML entities for the representation of non-ASCII characters. 

3.2 Marku p of gross document structure 
Various encodings of input texts also encode the structure of the documents in vastly different and often inconsistent manners. This structure includes such things as divisions of the text, headers and titles, paragraphs, tables, footnotes, emphasised text, etc. In general it is a very hard task to correctIy and completely transform this structure into de­scriptive TEI markup. For many natural language process­ing applications this might not even be necessary, as the task here isn't to preserve the Iayout of the text, but only the information that is relevant to correctly classify the text and to enable linguistic processing. To illustrate a čase of relatively detailed structure markup, we give, in Figure 1, a TEI encoded example from the MULTEXT-East corpus. 
<text lang="en" id="Oen."> 

<body> 

<div id="Oen.l" type="part"> 

<head>First part</head> 

<div id="Oen.l.l" type="chapter"> 

<head>I</head> 

<p id="Oen.l.l.l"> 
It was a bright cold day in April, and the 
clocks were striking thirteen. Winston Smith, 
his chin nuzzled into his breast in an effort 
to escape the vile wind, slipped quickly 
through the glass doors of Victory Mansions, 
though not quickly enough to prevent a swirl 
of gritty dust from entering along with him. 

</p> 

Figure 1: Structure markup in TEI 

3.3 Header Information 
A corpus is typically composed of a large number of in­dividual texts. For.analysing the corpus or for choosing a particular subset out of it, it is vital to include iiiformation about-the texts into thexorpus. Of course, the corpus as a unit must also be documented. The TEI provides a header element, <teiHeader > expressly meant to capture such meta-data. The TEI header contains detailed information about the file itself, the source of its text, its encoding, and revision history. 
The information in a text or corpus header can be — de­pending on the number of texts and the regularity of prove­nance - either inserted manually or automatically. Given that the corpora discussed in this paper have relatively few components, their headers have been entered manually via an SGML/XML aware text editor. 


3.4 Tokenisation and segmentation 
The next step in corpus preparation already involves basic linguistic analysis, namely isolating the linguistic units of the text, i.e., words and sentences. The Identification of words — and punctuation marks — is usually referred to as tokenisation, while determining sentence boundaries goes by the name of segmentation. 
On the face of it, determining vv-hat is a wprd and what,. a sentences might seem trivial. But correctly performing these tasks is fraught with complexities [12] and the rules to perform them are, furthermore, language dependent. So, while sentences end with full stops, not every full stop ends a sentence, as with, e.g., Mn; and if some abbreviations will never end sentences, e.g., e.g., other almost invariably will, e.g., etc. Correct tokenisation is complex as well; punc­tuation marks can sometimes be part of a vi'ord, as is the čase with abbreviations and, say, Web addresses. Some do­mains, e.g., biomedicine have "words" with an especially complex internal structure, e.g., Ca(2+)-ATPase. 
In the process of tokenisation various types of words and punctuation symbols must be recognised and this informa­tion can be retained in the markup, as it can be potentially useful for further processing. In Figure 2 we give an exam­ple of a segmented and tokenised text from the IJS-ELAN corpus, where the typ e attribute expresses such informa­tion on words. 
<seg id="ecmr.en.17"> 

<w>Euromoney</w><w type="rsplit">'s</w> 

<w>assessment</w> <w>of</w> <w>economic</w> 

<w>changes</w> <w>in</w> <w>Slovenia</w> 

<w>has</w> <w>been</w> <w>downgraded</w> 

<c type="open">(</c><w>page</w> 

<w type="dig">6</w><c type="close">)</c> 

<c>.</c> 
</seg> 

Figure 2: Segmentation and tokenisation in TEI 
While it is possible to write a tokeniser and segmenter using a general purpose computer language there also exist freely available tools for this purpose. We have extensively. used the MULTEXT tools [3], which, however,,no longer. seem to be maintained. 
Fortunately, other good choices exist, e.g., the text to­keniser tool, LT TTT [13], which is freely distributed for academic purposes as binaries for Sun/Solaris. LT TTT is based on XML and incorporates a general purpose cas­caded transducer which processes an input stream deter­ministically and rewrites it according to a set of rules pro­vided in a grammar file, typically to add mark-up infor­mation. With LT TTT come grammars to segment En­glish texts into paragraphs, segment paragraphs into words, recognise numerical expressions, mark-up money, date and tirne expressions in newspaper texts, and mark-up biblio­graphical information in academic texts. These grammars are accompanied by detailed documentation which allows altering the grammars to suit particular needs or develop new rule sets. While we have already successfully used LT TTT for processing English texts, the localisation of its grammars to Slovene remains stili to be done. 


3.5 Sentence Alignment 
An extremely useful processing step involving parallel cor­pora is the alignment of their sentences. Such alignment would be trivial if one sentence were always translated into exactly one sentence. But while this may hold for certain legal texts, translators — either due to personal preference, or because different languages tend towards different sen­tence lengths — often merge or spUt sentences, or even omit portions of the text. This makes sentence alignment a more challenging task. 
High-quality sentence alignment is stili an open research question and many methods have been proposed [23] in­volving the utilisation of e.g., bilingual lexicons and doc­ument structure. Stili, surprisingly good results can be achieved with a language-independentand knowledge poor method, first discussed in [11]. The alignment algorithm here makes use of the simple assumption that longer sen­tences tend to be translated into longer sentences, and shorter into shorter. So, if we come across, e.g., two short sentences in the original, but one long one in the transla­tion, chances are, the two have been merged. Hence the input to this aligner is only the lengths of the respective sentences in characters and the program, with an algorithm known as dynamic tirne warping, finds the best fit for the alignments, assuming the valid possibilities are 1-2, 0-1, 1-1, 1-0, 2-1, and 2-2. 
Several public implementations of this algorithm exist; we have used the so called Vanilla aligner [4], implemented in C and freely available in source code. 
The quality of the automatic alignment is heavily depen­dent on the manner of translation but, in any čase, is sel­dom perfect. For our corpora we have manually validated the alignments via a cyclic process, with initial errors of alignment corrected and the text then being automatically re-aligned. 
The end result is the sentence aligned text; the align­ment Information might then be encoded in one of sev­eral ways. One possibility is to encode the alignments in separate documents, where only pairs of references to sen­tence IDs are stored. Figure 3 gives a hypothetical Slovene-English alignment span illustrating the syntax and types (one, many, zero) of the alignment links. The first link en­codes an 1-1 alignment, the second a 2-1 and the third an 1-0 alignment. 
<lin k xtargets="Osl.1. 1 ; Oen.l.l"/ > <lin k xtargets="0sl.l. 2 Osi.1. 3 ; Oenl.l.2"/ > <lin k xtargets="Osl.1. 4 ; "/ > 
Figure 3: Example of stand-off bilingual alignment 



4 Word-class syntactic tagging 
It is well known that the addition of word-level syntactic tags adds significantly to the value of a corpus [22]. Know­ing the part-of-speech and other morphosyntactic features. 
T. Erjavec 
such as number, gender and čase, helps to lexically deter­mine the word and serves as a basis for further syntactic or semantic processing. In a parallel corpus such annota­tions can also act as guides for automatic bilingual lexicon extraction or example based machine translation. 
The flip side of morphosyntactic tagging is lemmatisa­tion, i.e., annotating the words in the corpus with their lem­mas or base forms. Such a normalisation of the word-forms is useful in concordancing the corpus and in identifying translation equivalents. While lemmatisation for Enghsh is relatively simple (although wolves and oxen complicate matters) it is a more difficult task in Slovene, which is a heavily inflecting language. 
Manual annotation is extremely expensive, so corpora are typically tagged and lemmatised automatically. Below we explain our work on the IJS-ELAN corpus, where we used the statistical tagger TnT v/hich had been trained on the MULTEXT-East parallel corpus, and initial results im­proved in various ways. 
4.1 The TnT tagger 
Trainable word-class syntactic taggers have reached the level of maturity where many models and implementations exist, with several being robust and available free of charge. Prior to committing ourselves to a particular implementa­tion, we conducted an evaluation (on Slovene data) of a number of available taggers [7]. The results show that the trigram-based TnT tagger [1] is the best choice considering accuracy (also on unknovvn words) as well as efficiency. TnT is freely available under a research license, as an exe­cutable for various platforms. 
The tagger first needs to be trained on an annotated cor­pus; the training stage produces a table with tag tri- bi- and uni-grams and a lexicon with the word forms follovved by their tag ambiguity classes, i.e., the list of possible tags, to­gether with their frequencies. Using these two resources, and possibly a backup lexicon, tagging is then performed on unannotated data. 

4.2 The training corpus 
The greatest bottleneck in the induction of a quality tagging model for Slovene is lack of training data. The only avail­able hand-validated tagged corpus is the Slovene part of the MULTEXT-East corpus, which is annotated with vali­dated context disambiguated morphosyntactic descriptions and lemmas. 
These morphosyntactic descriptions (MSDs) for Slovene 
— and six other languages, English among them — were developed in the MULTEXT-East project [6, 9]. The MSDs are structured and more detailed than is commonly the čase for English part-of-speech tags; they are compact string representations of a simplified kind of feature structures. The first letter of an MSD encodes the part of speech, e.g., Noun or Adjective. The letters follow­ing the PoS give the values of the position determined attributes. So, for example, the MSD Ncfp g expands to PoS:Noun , Type:common , Gender:feminine , Number : plural . Čas e : genitive . In čase a certain attribute is not appropriate for the particular combination of features, or for the word in question, this is marked by a hyphen in the attribute's position. 
To illustrate the properties of the training corpus, as well as the difference between Slovene and Enghsh we give in Table 1 the number of word tokens in the corpus, the num­ber of different word types in the corpus (i.e., of word-forms regardless of capitalisation or their annotation), the number of different context disambiguated lemmas, and the number of different MSDs. The inflectional nature of Slovene is evident from the larger number of distinct word-forms and especially MSDs used in the corpus. 
English Slovene 
Words 104,286 90,792 
Forms 9,181 16,401 
Lemmas 7,059 7,903 
MSDs 134 1,023 
Table 1: Inflection in the MULTEXT-East corpus 


4.3 Tagging Slovene 
Unsurprisingly, the tagging produced by TnT trained only on the MULTEXT-East corpus had quite a low accuracy; this can be traced both to the inadequate induced lexicon, as more than a quarter of ali word tokens in IJS-ELAN were unknovvn, as well as to n-grams applied to very dif­ferent text types than what was used for training. To offset these shortcomings we employed two methods, one primar­ily meant to augment the n-grams, and the other the lexi­con. 
It is well knovvn that "seeding" the training set with a validated sample from the texts to be annotated can signif­icantly improve results. We selected a sample comprising 1% of the corpus segments (approx. 5,000 words) evenly distributed across the whole of the corpus. The sample was then manually validated and corrected, also with the help of Perl scripts, which pointed out certain typical mistakes, e.g., the failure of čase, number and gender agreement be­tween adjectives and nouns. The tagger n-grams were then re-leamed using the concatenation of the validated ELAN sample with the Slovene MULTEXT-East corpus. 
It has also been shown [14] that a good lexicon is much more important for quality tagging of inflective languages than the higher-level models, e.g., bi- and tri-grams. A word that is included in a TnT lexicon gains the Infor­mation on its ambiguity class, i.e., the set of context-independent possible tags, as well as the lexical probabili­ties of these tags. 
The Slovene part of the ELAN corpus was therefore first lexically annotated, courtesy of the company Ame­bis, d.o.o., which also produces the spelling checker for Slovene Word. The large lexicon used covers most of the Informatica 26 (2002) 299-307 303 
words in the corpus; only 3% of the tokens remain un­known. This lexical annotation includes not only the MSDs but also, paired with the MSDs, the possible lemmas of the vvord-form. 
We first tried using a lexicon derived from these anno­tations directly as a backup lexicon with TnT. While the results were significantly better than with the first attempt, a number of obvious errors remained and additional new errors were at times introduced. The reason turned out to be that the tagger is often forced to fall back on uni-gram probabilities, but the backup lexicon contains only the am­biguity class, with the probabilities of the competing tags being evenly distributed. So, TnT in effect often assigned a random tag from the ones available, leading to poor results. To remedy the situation, a heuristic was used to estimate the lexical frequencies of unseen words, taking as the ba­sis the known frequencies from similar ambiguity classes taken from the training corpus. 
Using this lexicon and the seeded model we then re-tagged the Slovene part of the IJS-ELAN corpus. Manually validating a small sample of the tagged corpus, consisting of around 5,000 words, showed that the current tagging ac­curacy is about 93%. 
As had been mentioned, the lexical annotations included lemmas along with the MSDs. Once the MSD disambigua­tion had been performed it was therefore trivial to anno­tate the words with their lemmas. But while ali the words, knovvn as well as unknown, have been annotated with an MSD, we have so far not attempted to lemmatise the ap­proximately 3% of the corpus vvords which are unknovvn. 
The results of the tagging were encoded in the corpus as attribute values of the TEI <w> element. To illustrate, we give in Figure 4 an example sentence from the corpus: Razlike med metropolitanskimi centri in njihovim zaledjem so ogromne. / The dijferences between the metropolitan centres and their hinterlands are enormous. 
<seg id="ecmr.si.92"> 

<w ana="Ncfpn" lemma="razlika">Razlike</w> 

<w ana="Spsi" lemma="med">med</w> 

<w ana="Aopmpi">inetropolitanskimi</w> 

<w ana="Ncmpi" lemma="center">centri</w> 

<w ana="Ccs" lemma="in">in</w> 


<w ana="Ps3fpdp" lemma="njihov">njihovim</w> 

<w ana="Ncnsi" lemma="zaledje">zaledjem</w> 

<w ana="Vcip3p--n" lemma="biti">so</w> 


<w ana="Afpfpn" lemina="ogromen">ogromne</w> 

<c ctag=".">.</c> 

</seg> 

Figure 4: Linguistic annotation in the corpus 


4.4 Tagging English 
Tagging the English part of the corpus with the MULTEXT-East MSDs was also performed with TnT, using the En­glish part of the MULTEXT-East corpus as the training set. Hovvever, automatic tagging vvith this model is bound to contain many errors, although less than for Slovene, given the much smaller tagset. 

Rather than try to improve the accuracy of our own tag­ging, v/e opted for additional annotations with other, better, models and also with a better known tagset, namely (vari­ants of) the one used in the Brown corpus [15]. For the additional annotation of the English part we combined the output of two taggers. 
First, the TnT tagger distribution already includes some English models. We chose the one produced by training on the concatenation of the Brown corpus with the Wall Street Journal corpus; this training set contained approximately 
2.5 million tokens and distinguishes 38 different word-tags. 
Second, we used QTag [16], which is also freely avail­able probabilistic tri-gram tagger, although the underlying algorithm differs from that employed by TnT. The English model of QTag ušes a similar, although not identical, tagset to the TnT English one. QTag is also offered via an email service, which in addition to tagging the texts, also lem­matises them; we used this lemmatisation to annotate the 
corpus. To illustrate the tagging of the English part, we give an example in Figure 5. 
<seg id="ecmr.en.92" corresp="ecmr.si.92"> 
<w ana="Pt3" ctag="EX EX" 
lemma="there">There</w> 
<w ana="Vmip-p" ctag="VBP BER" 

leinma="be">are</w > 
<w ana="Afp " ctag="J J JJ " 
lemina="huge">huge</w > 
<w ana="Ncnp " ctag="NN S NNS" 

lemma="difference">differences</w> 
<w ana="Sp" ctag="IN IN" 

lemina="between">between</w > 
<w ana="Dd " ctag="D T DT" 
lemina="the">the</w > 
<w ana="Ncnp " ctag="?NN S NNS" 

lemma="centre">centres</w> 
<w ana="Cc-n" ctag="CC CC" 

lemma="and">and</w > 
<w ana="Ds 3 p " ctag="PRP $ PP$ " 
lemnia="they">their</w > 
<w ana="Afp " ctag="J J JJ " 
leinma="suburban">suburban</w > 
<w ana="Ncnp " ctag="NN S NNS" 

lemma="area">areas</w> 
<c ctag=".">.</c> 
</seg> 

Figure 5: Linguistic annotation in the English part 






5 Utilising the Corpus 
The IJS-ELAN corpus was thus encoded in XML/TEI, seg­mented, tokenised, aligned and tagged with morphosyntac­tic descriptions and lemmas. It was now tirne to turn to exploiting the corpus. Given that the texts that are included in the corpus do not have copyright restrictions (they are T. Erjavec 
mostly publications of the Slovene govemment), it was trivial to ensure one type of "exploitation", namely to sim­ply make the complete corpus freely available for down­loading: it can be accessed at http://nl.ijs.si/elan/. 
In this section we discuss two methods of utilising the corpus. The first is geared directly tovvards human usage, and has to do with making the corpus available for sophis­ticated on-line searching. The second employs a statistics-based tool that extracts a bi-lingual lexicon from the cor­pus. 
5.1 Web concordancing 
For our corpora we have developed an on-line concordanc­ing system. This Web concordancer comprises a set of HTML pages, a simple Perl CGI script and a corpus pro­cessing back-end. The back-end is the CQP system [2], a fast and robust program, which freely available for re­search purposes as binaries for a number of platforms. CQP supports parallel corpora and incorporates a povver­ful query language that offers extended regular expressions over positional (e.g., word, lenima, MSD) and structural (e.g., <text> , <p> , <seg> ) attributes. 
The Web page of the concordancer contains various in­put fields and settings available to the user. The settings and options have associated hyperlinks, and clicking on them gives help on the particular topic. So, for example, the Display setting affects how the search results are pre­sented: the Bilingual Display shows the hits in the target corpus, follovved by their aligned segment in the transla­tion; the KWIC Display shows the results in the familiar key-word in context format; and Word List Display gives a list of word types found in the corpus, together with their frequencies. The last option makes the most sense with fuzzy queries. 
The result of the query can also be refined by specifying an additional query on the aligned corpus. This constraint can be either required or forbidden. The latter option is useful when exploring 'unexpected' translations. 
The on-line concordancer has been in used at the De­partment of Translation and Interpreting at the University of Ljubljana for different purposes such as contrastive anal­ysis, translation evaluation, translation-oriented lexical and terminology studies, discourse analysis, etc. The method­ological aims of Ihis work were, on the one hand, to help students gain a deeper understanding of living language and remember things they discover themselves, and, on the other, to enable them to become skilled and critical users of corpora for translation purposes. 
The concordancer is also being used by translators, esp. by the volunteers of LUGOS, the Linux Users' Group of Slovenia, that are localising Linux documentation, e.g., the HOWTOs and the KDE desktop environment. As the IJS-ELAN corpus contains a whole book on Linux and the PO localisation files, it can be a welcome source of ter­minology translations. 

5.2 Lexicon extraction 
We have also performed an initial experiment in automatic bi-Ungual lexicon extraction from the corpus. Extracting such lexica is one the prime ušes of parallel corpora, as manual construction is an extremely time consuming pro­cess yet the resource is invaluable for lexicographers, ter­minologists, translators as well as machine translation sys­tems. 
A number of similar experiments had already been per­formed on the IJS-ELAN corpus [24, 5, 25], using a num­ber of different tools. The software we have used here is the PWA system [21], which is a coUection of tools for automatically finding translation equivalents in sentence aligned parallel corpora. The output of the system is, in­ter alia, a list of word token correspondences (i.e., transla­tions in the text), a list of word type correspondences (i.e., a lexicon) and lists of monolingual collocations (i.e., a termi­nological glossary). The system is freely available under a research license as a binary for various platforms. 
For the data, we have used one of the elements of the IJS­ELAN corpus, namely the book "Linux Installation and Getting Started" by Matt Welsh et al., and translated into Slovene by Roman Maurer. The book contains 2 x 5,773 aligned sentence segments, with the English original hav­ing 91,526 and the Slovene translation 81,955 word tokens. 
For lexicon extraction we have not used the word-forms directly but rather the lemmas (where defined) of the words in question. This normalises the input and abstracts away from the rich inflections of Slovene, which wouId cause PWA to treat different forms of the same word as differ­ent words. Secondly, we reduced the input to the system to only adjectives, nouns and punctuation symbols. The reasoning behind this is that most useful (terminological) lexical correspondences will be noun phrases, and elimi­nating the other word classes reduces the chance of spuri­ous translation correspondences. We have included punc­tuation signs in order to break up long stretches of nouns, which otherwise tend to get analysed as collocations. 
For this input data the PWA sy stem took 15 minutes (on a Pentium laptop) to produce the results, i.e., the list of token correspondences totalling 22,880 items, the lexicon con­taining 2,850 entries, and a list of collocations with 1,329 entries. 
To illustrate, we show in Figure 6 data for one sentence from the text; first we give the sentence and its translation, then the equivalent input to the system, and finally the com­puted translation equivalents. 
Most posited translations are correct, but some are less than perfect. While the system correctly identifies transla­tion equivalents for linu x and system , it misses out on the larger collocation linu x system , and similarly for use r progra m and developmen t tool . The main shortcomming of the output for the example sentence is the suggested translation equivalent for sourc e code , as it lacks the noun koda . But despite these omissions, the re­sult is already quite useful. 
Informatica 26 (2002) 299-307 305 

English sentence: In addition, ali of the source code for the Linux systein, inctuding the kernel, device drivers, li­braries, user programs, and development tools, isfreelydis­tributable. 
Slovene sentence: Dodatno je dostopna in prosto razširljiva še vsa izvorna koda sistema Linux, vključno zjedrom, gonilniki naprav, knjižnicami, uporabniškimi programi in razvojnimi orodji. 
English input: addition , source code linux 
system , kernel , device driver , 
librarv , user program , development 
tool , distributable . 

Slovene input: dostopen razširljiva izvoren 
koda sistem Linux , jedro , gonilnik 
naprava , knjižnica , uporabniški 
program razvojen orodje . 

Output English —> Slovene translations: source code —> izvoren linux -^ linux system —> sistem kernel —> jedro device driver —> gonilnik library —>• knjižnica user —> uporabniški program —> program development —> razvojen tool —> orodje 
Figure 6: Automatically extracted translation equivalents 


6 Conclusions and Further Research 
The paper presented the processing steps involved in build­ing and exploiting parallel corpora and introduced the tools necessary to accomplish this task. We have tried to show how third-party publicly available software is sufficient to operationalise the complete tool chain, and how language resources can be built in a cyclic manner, with initial anno­tations enabling the production of language models vvhich, in tum, enable refinement and further annotation. 
The text processing model outlined in this article is es­pecially useful in an academic environment: the softvvare to implement it is free, and can thus be easily acquired by cash-strapped university departments. The building as well as the exploitation of the resources can be profitably used for teaching purposes, be it for computer science courses on natural language processing, or for linguistic courses on the use of language technology. As has been shovvn, the resources can also be used directly, by helping translators or language students make use of bi-lingual data. 
As far as corpus compilation goes, our further work can be divided into two areas. The first is, of course, the ac­quisition of more texts, and hence the production of larger corpora. The second, and scientifically more challenging, is the addition of further markup. In the paper we have discussed only basic linguistic markup. Useful further an­notations include terms (either single or multiword), named entities (proper names, acronyms, dates, etc) , chunks (esp. 
noun phrases), and phrase structure (i.e., full syntactic anal­ysis). Each of these areas has been the subject of much research but, so far, not yet attempted for Slovene. 
On the exploitation side, in addition to carrying further our research on lexicon extraction, we plan to experiment with statistical machine translation, in particular to use the freely available system EGYPT [18] with the IJS-ELAN corpus as the training set. 
Acknowledgements 
Thanks goes to Amebis, d.o.o., for lexically annotating ver­sion 2 the Slovene part of the IJS-ELAN corpus and to Jin-Dong Kim and two anonymous reviewers for reading the draft of this paper. 

References 
[1] Thorsten Brants.	 TnT - A Statistical Part-of-Speech Tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000, pages 224-231, Seattle, WA, 2000. http://www.coli.uni­sb.de/~thorsten/tnt/. 
[2] Oliver Christ.	 A Modular and Flexible Architecture for an Integrated Corpus Query System. In Proceed­ings of COMPLEK '94: Srd conference on Compu­tational Lexicography and Text Research, pages 23 ­32, Budapest, Hungary, 1994. http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/. 
[3] Philippe Di Cristo.	 MtSeg: The Muitext multilin­gual segmenter tools. MULTEXT Deliverable MSG 1, Version 1.3.1, CNRS, Aix-en-Provence, 1996. http://www.lpl.univ-aix.fr/projects/multext/MtSeg/. 
[4] Pemilla	 Danielsson and Daniel Ridings. Prac­tical presentation of a "vanilla" aligner. In TELRl Newsletter No. 5. Institute fuer Deutsche Sprache, Mannheim, Mannheim, 1997. http://nl.ijs.si/telri/VaniIla/doc/ljubljana/. 
[5] Gael Dias, Špela Vintar, Sylvie Guillore, and Jose Gabriel Pereira Lopes. Normalising the IJS-ELAN Slovene-English Parallel Corpus for the Extraction of Multilingual TerminoIogy. In Computational Linguis­tics in the Netherlands J999, Selected Papers from the Tenth CLIN Meeting, pages 29-40, Utrecht, 1999. UILOTS. 
[6] Ludmila	 Dimitrova, Tomaž Erjavec, Nancy Ide, Heiki-Jan Kaalep, Vladimir Petkevič, and Dan Tufig. Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Lan­guages. In COLING-ACL '98, pages 315-319, Mon­treal, Quebec, Canada, 1998. http://nl.ijs.si/ME/. 
[7] Sašo Džeroski,	 Tomaž Erjavec, and Jakub Zavrel. Morphosyntactic Tagging of Slovene: Evaluating T. Erjavec 
PoS Taggers and Tagsets. In Second International Conference on Language Resources and Evaluation, LRECOO, pages 1099-1104, Pariš, 2000. ELRA. 
[8] Tomaž Erjavec. The ELAN Slovene-English Aligned Corpus. In Proceedings of the Machine Transla­tion Summit VII, pages 349-357, Singapore, 1999. http://nl.ijs.si/elan/. 
[9] Tomaž Erjavec.	 Harmonised Morphosyntactic Tag­ging for Seven Languages and Orweirs 1984. In 
6th Natural Language Processing Pacific Rim Sym­
posium, NLPRS'01, pages 487-492, Tokyo, 2001. 
http://nI.ijs.si/ME/V2/. 
[10] Tomaž Erjavec. The IJS-ELAN Slovene-English Par­allel Corpus. International Journal of Corpus Lin­guistics, 7(1): 1-20, 2002. http://nl.ijs.si/elan/. 
[11] William Gale and Ken W. Church.	 A program for ahgning sentences in bilingual corpora. Computa­tional Linguistics, 19(1):75-102,1993. 
[12] Greg Grefenstette	 and Pasi Tapanainen. What is a word, what is a sentence? problems of tok­enization. In Proceedings of The Srd International Conference on Computational Lexicography (COM­PLEX'94), pages 79-87. Research Institute for Lin­guistics, Hungarian Academy of Sciences, Budapest, 1994. 
[13] Claire Grover, Colin Matheson, Andrei Mikheev, and Marc Moens. LT TTT - A Flexible Tokenisation Tool. In Second International Conference on Lan­guage Resources and Evaluation, LRECOO, 2000. http://www.ltg.ed.ac.uk/software/ttt/. 
[14] Jan Hajič.	 Morphological Tagging: Data vs. Dictio­naries. In ANLP/NAACL 2000, pages 94-101, Seatle, 2000. 
[15] Henry Kučera and William Nelson Francis. Compu­tational Analysis of Present Day American English. Brown University Press, Providence, Rhode Island, 1967. 
[16] Oliver Mason. Qtag — a portable probabilistic tagger, 1998. http://www-clg.bham.ac.uk/QTAG/. 
[17] Tony McEnery and Andrew WiIson. Corpus Linguis­tics. Edinburgh University Press, 1996. 
[18] Franz Josef Och and Hermann Ney.	 Statistical Ma­chine Translation. In Proceedings of the Euro­pean Associationfor Machine Translation Workshop, EAMTOO, pages 39-46, Ljubljana, Slovenia, May 2001. http://nl.ijs.si/eamlOO/. 
[19] John Sinclair.	 Corpus typology. EAGLES DOCU­MENT EAG-CSG/IR-Tl.l, Commission of the Eu­ropean Communities, 1994. 
[20] C. M. Sperberg-McQueen and Lou Buraard, editors. Guidelines for Electronic Text Encoding and Inter­change, The XML Version ofthe TEI Guidelines. The TEI Consortium, 2002. http://www.tei-c.org/. 
[21] Jorg Tiedemann.	 Extraction of translation equiva­lents from parallel corpora. In Proceedings of the 1 Ith Nordic Conference on Computational Linguis­tics. Center for Sprogteknologi, Copenhagen, 1998. http://numerus.ling.uu.se/~corpora/plug/pwa/. 
[22] Hans van Halteren, editor.	 Sjntactic Wordclass Tag­ging. KIuwer Academic Publishers, 1999. 
[23] Jean Veronis, editor. Parallel Text Processing: Align­ment and Use of Translation Corpora. Kluwer Aca­demic Publishers, 2000. 
[24] Špela Vintar. A Parallel Corpus as a Translation Aid: Exploring EU Terminology in the ELAN Slovene-English Parallel Corpus. In Proceedings of the 34th Colloquium of Linguistics, Germersheim, Frankfurt, 1999. Peter LangVerlag. 
[25] Špela Vintar. Using Parallel Corpora for Translation-OrientedTermExtraction. BabelJournal, 47(2): 121 ­132,2001. 
[26] W3C. Extensible markup language (XML) version 
1.0. URL, 1998. http://www.w3.org/TRyi998/REC­xml-19980210.