HARDWARE IMPLEMENTATION OF LANGUAGE RESOURCES FOR EMBEDDED SYSTEMS Matej Roje, Zdravko Kačič, iztok Kramberger Institute of Electronics, Faculty of Electrical Engineering and Computer Science, Maribor, Slovenia Key words; finite-state machines, finite-state transducers, ptionetic and morphological lexicons, spoken dialogue applications, Atmel flash memory, Atmel microcontroller Abstract: A lot of external natural language resources are used in spoken dialogue systems. These resources present considerable problems because of the needed space and slow lookup-time. It is, therefore, very important that the presentation of external language resources is time and space efficient. It is also very important that new language resources are easily incorporated into the system, without modifying the common algorithms developed for multiple languages. This paper presents the method and results of compiling the large Slovenian phonetic and morphology lexicons (SIflex and SImlex) into corresponding finite-state transducers (FSTs). Representation of large lexicons using finite-state transducers is mainly motivated by considerations of space and time efficiency. In addition the approach of hardware implementation for both large (Slovenian) lexicons is described. We will demonstrate that the structure of the FST is very appropriate for storing in the Atmel AT49BV161 flash memory chip and the lookup algorithm for obtaining any desired information from the FST structure can be efficiently implemented using the Atmel AT90S8515 microcontroller. The described hardware implementation of both Slovenian lexicons can be connected directly to the PC using RS232 above all for development and test purposes and can be used especially in embedded systems which use speech technology. Strojna implementacija jezikovnih virov za uporabo v vdelanih sistemih Ključne besede: končni stroji, končni pretvorniki, fonetični in morfološki leksikoni, Atmel flash pomnilnik, Atmel mikrokrmilnik Izvleček: V govornih sistemih dialoga se uporablja mnogo jezikovnih virov Uporaba obsežnih jezikovnih virov predstavlja velik problem tako zaradi porabljenega pomnilniškega prostora, kot tudi zaradi počasnega dostopanja do željene informacije. Zato je zelo pomembno, da je predstavitev uporabljenih virov časovno in pomnilniško optimalna. Pri večjezičnih sistemih je pomembna tudi preprosta vključitev jezikovnih virov drugih jezikov v sam sistem, ne da bi bilo pri tem potrebno spreminjati skupne algoritme razvite za več jezikov V članku predstavljamo metodo in rezultate prevajanja obsežnega slovenskega fonetičnega in morfološkega leksikona (SIflex in SImlex) v pripadajoča končna pretvornika (FST). Takšna predstavitev je izredno učinkovita tako glede porabe pomnilniškega prostora, kot tudi časa, potrebnega za dostop do informacije. Podrobneje bo predstavljen tudi pristop strojne implementacije obeh Slovenskih leksikonov Struktura končnih pretvornikov je zelo primerna za njihov zapis v "flash" pomnilniško integrirano vezje (Atmel AT49BV161), algoritem pridobivanja informacije iz strukture končnega pretvornika pa lahko enostavno implementiramo z uporabo mikrokrmilnika (Atmel AT90S8515). Opisano strojno implementacijo obeh slovenskih leksikonov lahko priklopimo direktno na PC računalnik preko RS-232 serijskih vrat, predvsem za razvojne namene in testiranje. Posebej zanimiva pa je uporaba predstavitve leksikonov s končnimi stroji in njihove v članku predlagane strojne implementacije. Oba leksikona, predstavljena s končnimi stroji, lahko učinkovito uporabimo v poljubnih vdelanih sistemih, ki uporabljajo govorno tehnologijo. 1. Introduction When using voice, at least speech recognition and text-to-speech synthesis technology should be integrated as significant constituent part of an embedded mobility suite in order to operate most of the common PDA (Personal digital assistance) functions such as e-mail, tasks, calendar, phone numbers and addresses. Both should allow natural way of communication using such devices. On the other hand the development of real models of human language that support research and technology development in language related fields requires a lot of linguistic data - lexicons containing thousands of words. In order to achieve the same language coverage, as in the case of e.g. the English language, such lexicons need to be up to ten times larger in the case of inflectional languages. Having a lot of Slovenian root forms can result in up-to 200 different inflectional forms. The use of such resources can, therefore, represent substantial computational load especially for embedded mobility systems, demanding low energy consumption and the smallest possible implementation. A given spoken language system, which uses fully inflected word forms, performs much worse with highly inflected languages (e.g. Slovenian) than with non or purely inflected languages (e.g. English), where the lexicons used can be much smaller. In general, external language resources (phonetic, morphology lexicons etc.) present a problem regarding memory usage and the time spent on lookup processes. Finite-state machines are already used in many areas of natural language processing. Their use from the computational point of view is mainly motivated by considerations of space and time efficiency. Linguistically, finite-state machines allow an easier description of most of the relevant local phenomena in the language /1/. They also provide compact representation of the specific external language resources needed for knowledge representation in the automatic text-to-speech synthesis systems. These features of finite-state machines are of major importance especially, when dealing with spoken dialogue systems. in the following sections an approach for compiling such lexicons into finite-state transducers is first presented that represent their time and space optimal representation. The effect of using finite-state transducers for the representation of external natural language resources means a greater reduction in the memory usage required by the lexicons, and an optimal access time (required for obtaining information) which is independent of the lexicons' sizes. In the following sections the whole compilation process into finite-state transducers is presented plus the results obtained for the described Slovenian lexicons (SIflex and SImlex). The lexicon representation appropriate for hardware implementation is then discussed in more detail. In conclusion the hardware implementation is presented of both lexicons (FST's) for use in embedded system. 2. Finite-state automata and finite-state transducers 2.1. Finite-state automata (FSA) Finite-state automata (PSA) /2/ can be seen simply as an oriented graph with labels on each arc. Their fundamental theoretical properties make FSAs very flexible, powerful and efficient. FSAs can be seen as defining a class of graphs and also as defining languages. 2.1.1. Definition A finite-state automaton A is a 5-tuple {E, Q, i, F, E) where E is a finite set called the alphabet, Q is a finite set of states, ieQ is the initial state, FcQ is the set of final states, and EoOxfi^u {e)jxQ is the set of edges. FSAs have been shown to be closed under union, Kleen star, concatenation, intersection and complementation, thus allowing for natural and flexible descriptions. In addition to their flexibility due to their closure properties, FSAs can also be turned into canonical forms that allow for optimal time and space efficiency /3/. 2.2. Finite-state transducer (FST) FSTs can be interpreted as defining a class of graphs, a class of relations on strings, or a class of transductions on strings/1/. On the first interpretation, an FST can be seen as an FSA, in which each arc is labeled by a pair of symbols rather than by a single symbol. Definition A finite-state transducer T is a 6-tuple (Ei, E2, 0, /, F, E) such that: Si is a finite alphabet, namely the input alphabet E2 is a finite alphabet, namely the output alphabet Q is a finite set of states /eO is the initial state FcQ is the set of final states £cQxZ*ix2;*2xQ is the set of edges As with FSAs, FSTs are also powerful because of the various closure and algorithmic properties. 3. Use of FSMs for time and space optimal Lexicon representation In general, when representing lexicons by automata, many entries share the same codes (strings, representing some piece of information). The number of codes is then small compared to the number of entries. Newly developed lexicons are more and more accurate and the number of codes can increase considerably. The increase in number of codes also increases the smallest possible size of such lexicons. During the construction of the automaton one needs to distinguish different codes, therefore space required for an efficient hashing of the codes can also become costly. Available lexicons that were used in this experiment suggest that the representation by automata would be less appropriate. Since morphological and phonetic lexicons can be viewed as a list of pairs of strings, their representation using finite-state transducers seems to be very appropriate. Representation of lexicons using finite-state transducers on the other hand also provides reverse look-up capability. The methods used in the compilation of large scale lexicons into finite-state transducers (FST) assume that the lexicons are given as large list of strings and not as a set of rules as considered by Kaplan and Kay for instance /1/. In Fig. 1 some items from Slovenian phonetic (SIflex) and morphology lexicon (SImlex) are shown. Both lexicons were compiled into corresponding finite-state transducers, using proprietary toolkit fsmHAL. It consists of a large set of various algorithms and tools for FSM manipulation and is written is C++ program lanugage. During the compilation process the following algorithms were used: union, deter-minization, and minimization {Aho, 1974; Watson, 1995), (Mohri,1995). mod/el m 0-d /e: I mod/ela m O - d /e: -1 a mod/elu m O - d /e: -1 u mod/elom mO-d/e:-l mod/eloma m O - d /e: -1 O - m a mod/elih mO-d/e:-l mod/eli mO-d/e:-l mod/ele mO-d/e:-l O m I X a) Figure 1: model mod/el.N:cmsn:cmsa modela mod/el.N:cmsg:cmdn:cmda mocelu mod/el.N:cmsd:cmsl modelom mod/el.N:cmsi:cmpd modeloma mod/el.N:cmdd:cmdi modelih mod/el.N:cmdl:cmpl modeli mod/el.N:cmpn:cmpi modele mod/el.N:cmpa b) Slovenian phonetic (a) and morphology lexicons (b). Slovenian morphology lexicon (SImlex) is coded according to Sampa /5/ and Multext specifications /6/. Figure 2: Part of Slovenian phonetic lexicon (SIflex) represented as FST. The representation using finite-state transducers was performed for the SIflex and Slmlex Slovenian lexicons. The starting size for SIflex was 1.8 MB (60.000 items) and 1.4 MB for Slmlex (40.000 items). The final size achieved using the presented algorithms was 352 kB for SIflex and 662 kB for Slmlex (Table 1) /4/. Representation of large lexicons using finite-state transducers is mainly motivated by considerations of space and time efficiency. For both lexicons a great reduction in size and optimal access time was achieved. Using such representation the look-up time is optimal, since it depends only on the length of the input word and not on the size of the lexicon. byte 0 byte 1 byte 2 byte 3 byte 4 byte 5 ftxil k ' llllllll r ! <1 fl 1 0 t^ I : aix; luijisnfi; j h Figure 3: Lexicons FST byte representation. (jump to the next 6-byte sequence). Using such approach, states are only indirectly marked and are actually defined with transition sequences. All the transitions not having set the stop bit belong to the same state. Next state transition start from the byte sequence, that has a stop bit set. FST, FST2 Number of states 69.498 90.613 Number of transitions 90.801 130.839 Size of bin file 252 kB 662 kB Table 1: The final finite-state transducers representing Slovenian phonetic (FSTi) (60.000 Items) and Slovenian morphology lexicon (FST2) (40.000 items). 4. Lexicons FST byte representation Finite-state transducers representing lexicons are actually finite-state automata that have transitions labeled with two symbols. One of the symbols represents input, the other output. Therefore they translate strings. Since FST's of large-scale lexicons can be quite huge (lots of states and transitions) their implementation is not trivial. It is very important to use 'every bit' in their binary representation. The information in the final FST binary file is organized into sequences of 6 bytes. Every such byte sequence describes information of the one state transition (Fig. 3). The information representing one transition is coded using 6 bytes. The choice depends on the maximum value of a particular data type and final FST size of the lexicons. The FST input and output alphabets for Slovenian lexicons (SIflex and Slmlex) (ortographic characters, SAMPA phonetic symbols, some punctuation symbols) can be coded using eight bits (one octet) (first byte for input alphabet symbol and the second byte for output alphabet symbol). The third byte serves for flags (final bit, stop bit, next bit) and other three bytes are used for calculation of the next state 5. Hardware implementation of the FST Lexicons The described FST representation of the lexicons is very appropriate also for implementation using the flash memory chip (Atme! AT49BV161 T). AT49BV161T is a 16-mega-bit (1 Mx16/2Mx8) 3 Volt Flash Memory and is organised as 1.048.576 words of 16 bits each, or 2.097.152 bytes of 8 bits each. Thex16 data appears on 1/00-1/15, thexB data appears on I/00-I/07. This device can be read or reprogrammed using a single 2.65V power supply, making it ideally for in-system programming. In the AT49BV/LV161 (T) configuration, the BYTE pin controls wheatherthe device data I/O pins operate in the byte or word configuration. In ourapproach the BYTE pin is set mtnil InitTer address 5 latch Y-decoder Y-gating X-decoder iiliiliiiii:iiiiiiiiiii^ MBIH M