Informatica 30 (2006) 447-452 447 SI-PRON Pronunciation Lexicon: a New Language Resource for Slovenian Jerneja Žganec Gros Alpineon Research and Development, Ulica Iga Grudna 15, SI-1000 Ljubljana, Slovenia E-mail: jerneja.gros@alpineon.si, http://www.alpineon.si Varja Cvetko-Orešnik and Primož Jakopin Fran Ramovš Institute of the Slovenian Language, Novi trg 4, SI-1000 Ljubljana, Slovenia E-mail: isj@zrc-sazi.si, http://isjfr.zrc-sazu.si/ Keywords: language resources, pronunciation lexicon, PLS Received: August 12, 2006 We present the efforts involved in designing SI-PRON, a comprehensive machine-readable pronunciation lexicon for Slovenian. It has been built from two sources and contains all the lemmas from the Dictionary of Standard Slovenian (SSKJ), the most frequent inflected word forms found in contemporary Slovenian texts, and a first pass of inflected word forms derived from SSKJ lemmas. The lexicon file contains the orthography, corresponding pronunciations, lemmas and morphosyntactic descriptors of lexical entries in a format based on requirements defined by the W3C Voice Browser Activity. The current version of the SI-PRON pronunciation lexicon contains over 1.4 million lexical entries. The word list determination procedure, the generation and validation of phonetic transcriptions, and the lexicon format are described in the paper. Along with Onomastica, SI-PRON presents a valuable language resource for linguistic studies and research of speech technologies for Slovenian. The lexicon is already being used by the Proteus Slovenian text-to-speech synthesis system and for generating audio samples of the SSKJ headwords. Povzetek: Članek opisuje nov jezikovni vir za slovenščino, slovar izgovarjav SI-PRON. 1 Introduction Consistent specification of word pronunciation is critical to the success of many speech technology applications. Most state-of-the-art Automatic Speech Recognition (ASR) and Text-To-Speech (TTS) systems rely on lexicons, which contain pronunciation information for many words. To provide for a maximum coverage of the words, multi-word expressions or even phrases, which commonly occur in a given application-domain, application-specific word or phrase pronunciations may be required, especially for application-specific proper nouns, such as personal names or location names. Several guidelines have been reported to define the structure of a pronunciation lexicon, ranging from simple two-column ASCII lexicons providing the mapping between graphemic and phonemic transcriptions, to more general de-facto standards and new standardization attempts, which are also handling multiple orthographies and multiple pronunciations. The ISO-TC37 initiative, which started at LREC 2002, initiated work on a family of ISO standards related to natural language processing (Romary et al., 2006). Currently these standards are available in working drafts of high-level specifications for word segmentation, feature structures, annotations, and also for lexicons. The highlevel specifications build on lower-level specifications in form of language and country codes, data categories, code scripts, and Unicode. Lexicon specifications are covered by the "Lexical Markup Framework" under ISO 24613 (Romary et al., 2006). The same description structure in terms of morphology, syntax and semantics (and translation) applies to monolingual up to multilingual lexicons. Multi-word expressions are given special attention. Another initiative, the W3C Voice Browser Activity, has recently issued a last-call working draft of the Pronunciation Lexicon Specification (PLS) Version 1.0 (W3C PLS Version 1.0, 2006), which is expected to be soon submitted as a W3C candidate recommendation. The PLS document was designed to enable interoperable specification of pronunciation information for both ASR and TTS engines within voice browsing applications. The mark-up language allows one or more pronunciations for a word or phrase to be specified using a standard pronunciation alphabet or if necessary using vendor specific alphabets. Pronunciations are grouped together into the PLS document which may be referenced from other markup languages, such as the Speech Recognition Grammar Specification (SRGS) and the Speech Synthesis Markup Language (SSML). 448 Informatica 30 (2006) 447-452 J.Ž. Gros et al. The Pronunciation Lexicon Markup Language, based on PLS, is designed to allow open, portable specification of pronunciation information for speech recognition and speech synthesis engines. The language is intended to be easy to use by developers while supporting the accurate specification of pronunciation information for international use. The LC-STAR project consortium published another set of recommendations for speech technology lexicons, with an emphasis on application in machine translation, speech recognition and speech synthesis (Shamas & van den Heuvel, 2004; Fersee et al., 2004). A Slovenian lexicon, produced at the University of Maribor, has been built in the scope of the project (Verdonik et al., 2004). Compared to the LC-STAR lexicon specifications the current version of PLS lacks description specifications for more complex features, such as morphological, syntactic, and semantic features of lexical entries. In Slovenian, lexical stress can be located on almost any syllable and it obeys hardly any rules. The stressed syllable in Slovenian may form the ultimate, the penultimate or the preantepenultimate syllable of a polysyllabic word. Speakers of Slovenian have to learn lexical stress positions along with learning the language. As a consequence, a pronunciation lexicon that indicates lexical stress positions for as many Slovenian words as possible is crucial for the development of speech technology applications and linguistic research. Such a lexicon can be used either in its full-blown form or as a training material for machine learning techniques aimed at automatically predicting word pronunciations. Several attempts towards pronunciation lexicon construction for Slovenian have been reported so far (Derlic & Kačič, 1997; Gros & Mihelič, 1999; Gros et al., 2001; Šef et al., 2002; Verdonik et al., 2002; Mihelič et al., 2003). However, none of them has used the full lemma set as given in the Dictionary of Standard Slovenian (SSKJ) (SSKJ, 1991). The paper describes the construction of a comprehensive reference pronunciation lexicon for Slovenian based on two sources: the information from the SSKJ and another list of the most frequent inflected word forms, which has been derived by an analysis of contemporary Slovenian text corpora. 2 The SI-PRON Lexicon 2.1 SI-PRON Wordlist The work on designing a new pronunciation lexicon begins with the selection of words, multi-word expressions or phrases, which will be represented in the lexicon. Several word-list selection procedures are known (Ziegenheim, 2003). The construction of the SI-PRON lexicon started with the complete lemma word list of 93,154 entries from the SSKJ provided by the Fran Ramovš Institute of the Slovenian Language, furnished with basic lexical stress information on the stressed vowels and pronunciation exceptions. The complete word pronunciations still had to be determined. In order to further expand the SI-PRON word list, we are augmenting the SSKJ lemma descriptions with part-of-speech information and declension/conjugation categories (Toporišič, 1991), specifying the inflectional paradigms of the lemmas. Irregular inflected word forms are processed separately. Using automatic procedures, we are fully expanding the lemmas into inflected word forms. So far, over 1 million lexemes containing lexical stress information have been derived. Since SSKJ contains many words derived from literary texts, not so common in everyday situations, we decided to upgrade the SI-PRON pronunciation lexicon with a list of 50,000 most frequent inflected word forms whose lemmas are not covered by the SSKJ word list. This additional word list has been derived from a statistical analysis of a contemporary Slovenian text corpus. The corpus comprising over 3 million Slovenian words was composed mainly from fiction and mainstream Slovenian newspaper texts: Delo, Večer, and the former Slovenec. After tokenization and the elimination of numerals, named entities, acronyms, and abbreviations, the remaining text corpus included over 3 million tokens. Acronyms, abbreviations, and named entities were stored into separate word lists. A statistical analysis performed on the text corpus showed that about 50.000 most frequent words accounted for approaching 95% of all non-SSKJ words used in the text corpus (Gros & Mihelič, 1999). These words form the main additional word list. They were equipped with part-of-speech tags indicating the part-of-speech function of the words in the text corpus. 2.2 Collocations and Multi-word Expressions The identification of collocations, i.e. current combinations of words as they appear in context, can considerably increase the naturalness of synthetic speech. In human speech, collocations act as prosodic units and are subject to a higher degree of reduction and internal coarticulation than they would be had they been ordinary, separate words. We have chosen a lexical approach for handling collocations. The most common collocations or multi-word expressions, reflexive verbs included, are stored in a separate pronunciation lexicon. 3 Phonetic Transcriptions We have developed a tool to automatically derive word pronunciations for the SSKJ inflected words, by looking-up their stem pronunciation and appending that of the correct inflection from inflectional paradigms and morphological rules of Slovenian (Toporišič, 1991). Therefore, the pronunciation of lexemes has been derived automatically for the SSKJ and SSKJ inflected word lists (about 2,500 entries, mainly words of foreign origin that do not obey the general Slovenian pronunciation rules, have been manually transcribed), and semi-automatically for the remaining part of the word list. Automatic lexical stress assignment and automatic SI-PRON PRONUNCIATION LEXICON. Informatica 30 (2006) 447-452 449 grapheme-to-phoneme conversion rules have been used to process the latter. 3.1 Lexical Stress Assignment The automatic lexical stress assignment algorithm for unseen words, which we applied is to a large extent determined by (un)stressable affixes, prefixes, and suffixes of morphs and is based upon observations by linguists (Toporišič, 1991). For words that do not belong to these categories, the most probable stressed syllable is predicted using the results from a statistical analysis of stress position depending on the number of syllables within a word (Gros & Mihelič, 1999). 3.2 Grapheme-to-Phoneme Rules Context-free grapheme-to-allophone rules from the Proteus standard words rule set (Žganec Gros, 2006) translate each grapheme string into a series of allophones. The rules are accessed sequentially until a rule that satisfies the current part of the input string is found. The transformation defined by that rule is then performed, and a pointer is incremented to point at the next unprocessed part of the input string. The procedure is repeated until the whole string has been converted. The context free rules are rare and they include a one-to-one correspondence, two-to-one correspondence and one-to-two correspondence. The vast majority of the rules for grapheme-to-allophone transcription for Standard Slovene are context-sensitive. This means that a grapheme or a string of graphemes is transcribed differently according to its phonetic environment. Certainly all rules for determining which allophone of a certain phoneme is to be used in a phonetic sequence are context-dependent. Each context-sensitive rule consists of four parts: the left context, the string to be transcribed, its right context and the phonetic transcription. A number of writing conventions has been adopted in order to keep the number of rules relatively small and readable. The left and the right context may contain code characters describing larger phonetic sets, e.g.: '#' stands for vowels, '$' for consonants, '_' for white space. The rules for consonants are rather straightforward, while those for vowels must handle vowel length and the variant realizations of the orthographic /e/ and the orthographic /o/ in stressed syllables. A typical grapheme-to-allophone rule in the Proteus standard words rule set has the following structure: left grapheme right allophone context string context string $ /er/ _ [@r] = /n/ k [N] The first rule says that the word final /er/ preceded by a consonant is transcribed as [@r] (e.g. /gaber/ -> [*ga:.b@r]). The second rule implies that any /n/ followed by /k/ is transcribed into [N] ([N] is the allophone of [n] when followed by /k/ or /g/, e.g. in /anka/ -> [*a:N.ka]). The initial rule set based on the one produced in 2001 (Gros et al., 2001) was built by taking into acconut various observations of expert linguists, e.g. (Toporišič, 1991), and other basic rule sets for Slovenian grapheme-to-allophone transcription (Gros & Mihelič, 1999). The initial set of rules has been undergoing continuous refinement ever since and resulted in 194 rules of the Proteus standard words rule set (Žganec Gros, 2006). Rules for coarticulatory pronunciation corrections of words according to the words' left context and to the right context are included. In the recent years, telecommunication applications of ASR and TTS have increased in importance, e.g. automatic telephone directory inquiry systems. Names of locations (cities, streets, etc.) and other proper names cannot be mentally reconstructed from the context when listening to the messages, and correct name pronunciation is required. The Proteus standard word rules developed for a standard Slovenian vocabulary do not lead to satisfactory results when applied to names. Therefore, additional 'name-specific' rules were added to the final Proteus standard words rule set resulting in the Proteus names rule set. 3.3 Transcription Accuracy Experiment The phonemization errors were determined by comparing the automatic transcription outputs to manually verified pronunciation lexicon transcriptions. A performance test applied on the SI-PRON SSKJ-based word list pronunciation lexicon showed error rates of about 25% in the stress assignment of unknown words and consequently in the phonetic transcription. If stress assignment and the transcriptions of graphemic /e/ and /o/ in stressed syllables was manually verified or known in advance, a transcription success rate of 99.1% was achieved for standard SSKJ words. A closer examination of the mismatches revealed that the majority of the errors could be attributed to inconsistencies in manual labelling during the preparation of the original SSKJ. As a consequence, we argue that, in order to semi-automatically derive phonetic transcriptions for Slovenian words not covered by the lexicon with a 0.3% error rate, manual validation of the stress position and its type have to be carried out, starting from automatically predicted stress positions. The rest can be performed automatically by applying our upgraded grapheme-to-phoneme conversion rule set. 4 SI-PRON Format The SI-PRON lexicon format complies with the Pronunciation Lexicon Specification (PLS) Version 1.0, a W3C Voice Browser Activity working draft of syntax specification for pronunciation lexicons (W3C PLS Version 1.0, 2006). This lexicon specification has been recommended for use by speech recognition and speech synthesis engines in voice browser applications. 448 Informatica 30 (2006) 77-452 J.Ž. Gros et al. dober nd/o:-b@r Figure 1. An example of a simple lexicon file with a single lexeme within SI-PRON. The element represents a lexical entry and may include multiple orthographies and multiple pronunciation information. An example of a simple lexicon file with a single lexeme within SI-PRON would be as shown in Fig. 1. In the Pronunciation Lexicon Specification, the pronunciation alphabet is specified by the alphabet attribute of the element. We are using the "x-sampa-SI-reduced" phonetic alphabet, a subset of the X-SAMPA set as defined for Slovenian (Zemljak et al., 2002), augmented with additional markers for Slovenian lexical stress accents (acute, circumflex, and grave) and tonemic accents (tonemic acute and tonemic circumflex). Both primary and secondary stress positions are marked. The element is used to provide the pronunciation of an acronym or an abbreviation in terms of an expanded orthographic representation. 4.1 Homographs Homographs or words with the same spelling but different pronunciations can be treated in two ways. If we do not want to distinguish between the two words then we can represent them as alternate pronunciations within the same element. In the opposite case, two different elements need to be used. In both cases the application, which is making use of the lexicon, will not be able to decide when to apply the first or the second transcription unless additional information, such as context-specific attributes or part-of-speech information is provided. 4.2 Multiple Pronunciations Providing multiple pronunciations for items that share the same orthography and meaning is important for speech recognition lexicons because they provide information on variations of pronunciation within a language. Therefore, for many lexemes, words, and multi-word expressions, multiple standard pronunciations are specified, including those, which consider possible coarticulation effects at word boundaries. Multiple pronunciations are indicated by subsequent elements within one element. Pronunciation preference - extensions needed? In TTS applications, typically only one pronunciation among the multiple pronunciation possibilities is required. Therefore, to indicate default pronunciation variation, the prefer attribute can be used in PLS. In SI-PRON, unless marked otherwise, the default pronunciation is the first pronunciation from SSKJ. However, sometimes several pronunciation variations in SSKJ are (almost) equally preferred, whereas the actual preferred pronunciation for the TTS engine may depend on the application. This is not to be confused with application-specific pronunciations, which can be handled in separate application-specific pronunciation lexica. What we have in mind is that there may exist several almost equally preferred pronunciations for a given grapheme, and the developers would like to have a mechanism that would enable them to systematically choose the preferred one. Typically one of the two almost equally preferred pronunciations yields better rendering of input text if the application requires either overarticulated or fluent pronunciation. Therefore, we would welcome a new optional attribute to the element in PLS, the: pron-style attribute indicating the preferred pronunciation variation of a lexeme with respect to the desired pronunciation style. The two attribute values, which would be useful for SI-PRON, are "fluent" and "overarticulated". In addition, the pron-style optional attribute would need to be introduced into SSML, as a defined attribute for the , ,

, and elements. For the same elements in SSML: , ,

, and , another optional attribute, emotion, would be useful (e.g. for computer games, where emotion changes occur frequently). Example: For Slovenian male nouns, ending with a consonant followed by "ilec", SSKJ often provides one of the following single or multiple pronunciations of the "ilc" sequence within the genitive form of the noun: [iUts]/[ilts], [ilts]/[iUts], [ilts], or [iUts]; examples would be Slovenian words "nosilca", "krotilca", "darovalca", SI-PRON PRONUNCIATION LEXICON. Informatica 30 (2006) 447-452 449 etc. Many other cases of such pronunciation variations are known for Slovenian, and are marked in SSKJ. Whenever there are two pronunciation variations in SSKJ they typically account for an overarticulated (e.g. [ilts]) or a more fluent (e.g. [iUts]) pronunciation variation. The pronunciation order as indicated in SSKJ indicates a slight pronunciation preference in standard usage and should still be indicated by the prefer attribute. In order to enable high-quality TTS such pronunciation differentiations should be captured in the text rendering process. This would avoid the confusion of having a multitude of TTS pronunciation lexicons with different variations of the default pronunciation as given by the prefer attribute. The multiple lexicons are impossible to edit synchronously, and the proposed approach would allow us to use one master pronunciation lexicon. 4.3 Multiple Orthographies Sometimes multiple orthographies of a word share the same meaning and pronunciation. They are presented with subsequent elements within a single element. 4.4 Part-of-Speech Tags The most recent specification of the PLS focuses on the major features described in the PLS requirements document. Many more complex features, such as those providing morphological, syntactic and semantic information associated with pronunciations are expected to be introduced in a future revision of the PLS specification. Therefore, proprietary and elements have been additionally defined for SI-PRON. Multext-East morphosyntactic descriptors for the Slovenian language, as described in (Eijavec, 2004), were used to provide the part-of-speech information of the lexemes, along with the lemmas. 5 SI-PRON Validation Finally, the SI-PRON lexicon has been subjected to an automatic validation as a way to ensure that the structure of the document is well-formed and conforms with the chosen Document Type Definition (DTD). Additionally, manual validation of both phonemic transcriptions and morphosyntactic descriptions was performed on a subset of the lexicon comprising 5.000 lexical entries. A subset from the LC-STAR lexicon specifications for lexicon validation criteria was used (Shamas and den Heuvel, 2002). A lexicon editing tool with a user-friendly interface has been designed to allow inspecting, editing, browsing and automatic validation of the pronunciation lexicon. 6 Conclusion Due to free lexical stress position, pronunciation lexica are of crucial importance for development of speech technology applications and linguistic research for Slovenian. They are not only used for providing application-specific pronunciations or pronunciations of names, but are indispensable in any TTS or ASR system. The task of constructing a master pronunciation lexicon is very tedious and time-consuming and should not be repeated often. Therefore, a master-lexicon approach is best suited for Slovenian TTS, in which many speaking-style pronunciation nuances are captured. We propose refined extensions to both PLS and SSML, which are described in section 4, and mainly deal with multiple pronunciations and morphosyntactic descriptions. Along with Onomastica, SI-PRON presents a valuable language resource for linguistic studies as well as for research and development of speech technologies for Slovenian. The lexicon is already being used by the Proteus Slovenian text-to-speech synthesis system (Zganec Gros, 2006) and for generating audio samples of the SSKJ word list, which are available at the very end of every SSKJ lexical entry description (SSKJ audio, 2006). 6.1 Acknowledgement A part of the presented work has been financed as an applied research project by the Slovenian Research Agency under contract No. 5405. References [1] Derlic, R., Kacic, Z., (1996). Definition of pronunciation dictionary of names and letter-to-sound rules for Slovene language - project Onomastica. In Proceedings of the 2nd International Workshop on Speech dialog man-machine, Maribor, Slovenia, June 26-27, pp. 153-158. [2] Eijavec, T. (2004). MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC'04, Lisbon, Portugal, pp. 1535-1538. [3] Fersee, H., Hartikainen, E., van den Heuvel, H., Maltese G., Moreno A., Shammass S., Ziegenhain U. (2004). Creation and Validation of Large Lexica for Speech-to-Speech Translation Purposes. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC'04, Lisbon, Portugal. [4] Gros, J., Mihelic, F., (1999). Acquisition of an extensive rule set for Slovene grapheme-to-allophone transcription. In Proceedings of the 6th European Conference on Speech Communication and Technology EUROSPEECH'99, Budapest, Hungary, pp. 2075-2078. [5] Gros, J., Mihelic, F., Pavesic, N., Zganec, M., Mihelic, A., Knez, M., Mercun, A., Skerl, D., (2001). The phonetic SMS reader. In Proceedings of the Text, speech and dialogue 4th international conference, Zelezná Ruda, Czech Republic, Lecture notes in artificial intelligence, 2166. Berlin: Springer, pp. 334-340 448 Informatica 30 (2006) 79-452 J.Ž. Gros et al. [6] Mihelič, F., Žganec Gros, J., Dobrišek, S., Žibert, J. and Pavešic, N., (2003). "Spoken language resources at LUKS of the University of Ljubljana", International Journal on Speech Technologies, Vol. 6, No. 3, pp. 221-232. [7] PLS-W3C, (2006). Pronunciation Lexicon Specification (PLS) Version 1.0, W3C Working Draft 31 January 2006. http://www.w3.org/TR/pronunciation-lexicon/ S4.7. [8] Romary, L., Francopoulo, G., Monachini, M. and Salmon-Alt, S. (2006). Lexical Markup Framework: working to reach a consensual ISO standard on lexicons. To be presented at LREC'06 as a tutorial. Genoa, Italy. [9] SSKJ audio (2006). available from http://bos.zrc-sazu.si/sskj.html. [10] Verdonik, D., Rojc, M., Kačič, Z., Horvat, B., (2002). Zasnova in izgradnja oblikoslovnega in glasovnega slovarja za slovenski knjižni jezik. In Zbornik konference Jezikovne tehnologije'02. Editors: Tomaž Erjavec, Jerneja Gros, Ljubljana, Slovenia, pp. 44-48. [11] Verdonik, D., Rojc, M. and Kačič, Z., (2004). Creating Slovenian language resources for development of speech-to-speech translation components, In Proceedings of the Fourth International Conference on Language Resources and Evaluation LREC'04. Lisbon, Portugal, pp. 1399-1402. [12] Shammass, S. & van den Heuvel, H., (2004). Specification of validation criteria for lexicons for recognition and synthesis, LC-STAR Deliverable D6.1. available from www.lc-star.com. [13] SSKJ (1997). Slovar slovenskega knjižnega jezika (The Dictionary of Standard Slovenian). 2nd edition, Ljubljana: DZS. [14] Šef, T., Gams, M., Škijanc, M., (2002). Automatic lexical stress assignment of unknown words for highly inflected Slovenian language. In Zbornik 11. mednarodne Elektrotehniške in računalniške konference ERK 2002. Portorož, Slovenija., pp. 247-250. in Slovenian. [15] Toporišič, J. (1991). Slovenska Slovenica (Slovenian Grammar). Založba Obzorja Maribor. [16] Zemljak, M., Kačič, Z., Dobrišek, S., Gros, J., Weiss, P., (2002). Računalniški simbolni fonetični zapis slovenskega govora. Slavistična revija, Vol. 50, No. 2, pp. 159-169. [17] Ziegenhain, U., (2003). Specification of corpora and word lists in 12 languages. LC-STAR Deliverable D1.1. available from www.lc-star.com. [18] Žganec Gros, J., (2006). Text-to-speech synthesis for embedded speech user interfaces, In WSEAS Transactions on Communications, No. 4, Vol. 5, pp. 543-548.