Informatica 28 (2004) 159–165 159 Using Finite-State Transducer Theory for Representation of Very Large Scale Lexicons Matej Rojc, Zdravko Kaèiè Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova 17, Maribor, Slovenia Phone: +386 2 220 7223, Fax: +386 2 2511 178 matej.rojc@uni-mb.si, kacic@uni-mb.si Keywords: finite-state transducers, natural language resources, multilingual text-to-speech synthesis, morphology, lexicons Received: September 17, 2002 In multilingual text-to-speech synthesis systems, many external extensive natural language resources are used, especially in the text processing part. Therefore it is very important that representation of these resources is time and space efficient. It is also very important that language resources for new languages can be easily incorporated into the system, without modifying the common algorithms developed for multiple languages. In this regard the use of large external language resources represents an important problem because of the needed space and slow lookup-time. In the paper a method and results of compiling large lexicons, with an example of compiling German phonetic and morphology lexicons (CISLEX), into corresponding finite-state transducers (FSTs) are presented. Each lexicon consisted of about 300.000 words. Representation of large lexicons using finite-state transducers is mainly motivated by considerations of space and time efficiency. For both lexicons a great reduction in size and optimal access time was achieved. The starting size for German phonetic lexicon was 12.53 MB and 18.49 MB for morphology lexicon. The final size of the corresponding FST was only 2.78 MB for the phonetic lexicon and 6.33 MB for the morphology lexicon. At the same time the look-up time is optimal, since it depends only on the length of the input word and not on the size of the lexicon. Using such representation, the integration of lexicons for new languages into the multilingual TTS system is easy and does not require any changes of algorithms that use such lexicons. 1 Motivation Finite-state machines are already used in many areas of and the optimal access time (required for obtaining in-natural language processing. From the computational formation) that is independent from the size of the lexi­point of view, their use is mainly motivated by conside-cons. The whole compilation process into finite-state rations of space and time efficiency. Linguistically, the transducers will be presented and at the end results finite-state machines [6][8][10][11] allow one to describe obtained for the German lexicons described. easily most of the relevant local phenomena in the lan­guage. They provide also compact representation of ex­ 2 Finite-state automata and ternal language specific resources needed for knowledge representation in the automatic text-to-speech synthesis finite-state transducers systems. These features of finite-state machines are of major importance especially when we are dealing with 2.1 Finite-state automata (FSA) multilingual text processing in text-to-speech synthesis Finite-state automata (FSA) [6] can be seen simply as an systems (TTS systems). oriented graph with labels on each arc. Fundamental theoretical properties make FSAs very flexible, powerful In multilingual text-processing module for the multilin­and efficient. FSAs can be seen as defining a class of gual TTS system, external natural language resources graphs and also as defining languages. (e.g., phonetic, morphology lexicons etc.) represent an important problem, regarding the memory usage and Definition time needed for lookup process. A finite-state automaton A is a 5-tuple ( .,Q,i, F, E ) In the following sections we are presenting an approach where . is a finite set called the alphabet, Q is a finite for compiling such lexicons into finite-state transducers set of states, i .Q is the initial state, F .Q is the set that represent their time and space optimal representa­ of final states and E . ({}xQ is the set of Qx .. .) tion. The effect of using finite-state transducers for repre­ edges. sentation of external natural language resources is great reduction of the memory usage required by the lexicons FSAs have been shown to be closed under union, Kleen star, concatenation, intersection and complementation, thus allowing for natural and flexible descriptions. In addition to their flexibility due to their closure properties, FSAs can also be turned into canonical forms that allow for optimal time and space efficiency. 2.2 Finite-state transducer (FST) FSTs [9] can be interpreted as defining a class of graphs, a class of relations on strings, or a class of transductions on strings. On the first interpretation, an FST can be seen as an FSA, in which each arc is labelled by a pair of symbols rather than by a single symbol. Definition A finite-state transducer T is a 6-tuple ( .1 , .2,Q, i, F, E ) such that: • .1 is a finite alphabet, namely the input alphabet, • .2 is a finite alphabet, namely the output alphabet, • Q is a finite set of states, • i .Q is the initial state, • F .Q is the set of final states, • E .Qx.1 * x.* 2 xQ is the set of edges. As with FSAs, FSTs are also powerful because of the various closure and algorithmic properties. In the paper we adhere to the following conventions when describing an FST: final states are depicted by bold circle; . repre­sents the empty string; the initial state (labelled 0) is the leftmost state appearing in the figure. 2.3 Use of FSMs for time and space opti­mal Lexicon representation When representing lexicons by automata, in general, many entries share the same codes (strings, representing some piece of information). The number of codes is then small compared to the number of entries. Newly develo­ped lexicons are more and more accurate and the number of codes can increase considerably. The increase in number of codes also increases the smallest possible size of such lexicons. During the construction of the automaton one needs to distinguish different codes, therefore space required for an efficient hashing of the codes can also become costly. Available lexicons that were used in this experiment suggest that the representation by automata would be less appropriate. Since morphological and phonetic lexi­cons can be viewed as a list of pairs of strings, their rep­resentation using finite-state transducers [10] seems to be very appropriate. The results given at the end of this paper also confirm this assumption. Representation of lexicons using finite-state transducers on the other hand also provides reverse look-up capability. In the multilingual TTS system morphological and pho­netic lexicons represent part of the external natural lan-M. Rojc et al. guage dependent resources used by multilingual text-processing engine. It is desired that language indepen­dent modules for morphology analysis and grapheme-to­phoneme conversion inside the multilingual text-proce­ssing engine use common algorithms for multiple lan­guages. This is possible when external natural language dependent resources are represented as finite-state trans­ducers. Integration of new lexicons for new languages in the whole TTS system is then very easy, since only compilation procedure (off-line) has to be performed. 3 Compilation process of large scale lexicons into finite-state transdu­ cers 3.1 Lexicons preparation The methods used in the compilation of large scale lexi­cons into finite-state transducers (FST) assume that the lexicons are given as large lists of strings and not as a set of rules as considered by Mehryar Mohri [3] for in­stance. Obviously morphological and phonetic lexicons can be viewed as a list of pairs of strings and their repre­sentation using finite-state transducers seems to be very appropriate. In Fig. 1 some items from German phonetic and morphology lexicons are shown. As with automata, direct construction of the sequential transducer representing a large-scale lexicon, is not possible because the construction leads to a blow up for a large number of entries. To avoid this, splitting the lexicon into several parts is performed. Then the construction of the corresponding sequential transducers including minimization operation follows. Using union, determinization, and minimization operations, only one transducer representing the whole lexicon is obtained at the end (Fig.2). 3.2 Determinization of finite-state trans­ducers The algorithm used is close to the powerset construction used for determinizing automata [3]. The main diffe­rence is that here one needs to provide states of the sets with strings. These strings correspond to a delay in the emission that is due to the fact that outputs correspond­ing to a given input can be different. Therefore only the longest common prefix of outputs can be kept and sub­sets represent actually pairs (state, string). The pseudo­code for the algorithm to determinize a transducer T1 is given in Fig. 3. "Abte "E p - t @ "Abten "E p - t @ n "Abtissin E p - t "I - s I n "Abtissinnen E p - t "I - s I - n @ n "Ackern "E - k 6 n "Aderchen "E: - d 6 - C @ n "Aderchens "E: - d 6 - C @ n s ....... a) "Abte abt.mask(NS1,NP12)#0:amM:gmM:nmM "Abten abt.mask(NS1,NP12)#0:dmM "Abtissin "Abtissin.fem(NS0,NP5)#0:aeF:deF:geF:neF "Abtissinnen "Abtissin.fem(NS0,NP5)#0:amF:dmF:gmF:nmF "Acker acker.mask(NS2,NP11)#0:amM:gmM:nmM "Aderchen "Aderchen.neut(NS2,NP0)#0:aeN:amN:deN:dmN:gm N:neN:nmN "Aderchens "Aderchen.neut(NS2,NP0)#0:geN ......... b) Figure 1: German phonetic (a) and morphology lexicons (b). German morphology lexicon is coded according to CISLEX specification [5]. German morphologic lexicon 300.000 items number of sub-lexicons 1.step 2.step 3.step Figure 2: Lexicons preparation. At each step a new state q2 is considered as can be seen in line 5. State q2 is a final state only if it contains a pair (q,w), where q is final in T1. String w is the final output at the state q2. In line 10, each input label a of the transitions leaving the states of the subset q2 is considered. A transition is constructed from state q2 to state .2(q2,a) with output .2(q2,a). Output .2(q2,a) represents the longest common prefix of the output labels of all the transitions leaving the states q of q2 with input label a, when left concatenated with their delayed string w. State .2(q2,a) is the subset made of pairs (q’,w’). Here q’ is a state reached by one of the transitions with input label a in T1 and w’ = [.2(q2,a)]­1w.1(q,a,q’) is the delayed string that could not be out­puted earlier in the algorithm. String [.2(q2,a)]­1w.1(q,a,q’) is a well defined string since [.2(q2,a)] is a prefix of all w.1(q,a,q’) as can be seen from line 10. Informatica 28 (2004) 159–165 161 In Fig. 5 we can see the result of using the determiniza­tion algorithm on transducer from Fig. 4 (obtained using union operation). In this example the number of states of the determinized transducer T2 is already less than in T1. Experiments showed that this method is very efficient in constructing transducers for representation of large lexi­cons. The disadvantage of this algorithm is that the out­puts are pushed toward final states, which creates a long delay in emission. But fortunately sequential transducers can be minimized as we will show in the next section. An important characteristic of the minimization algorithm is that it pushes back outputs as much as possible toward the initial state. In such a way we can eliminate the problem just mentioned. Determinize_transducer( T1,T2) 1 F2 ‹ O 2 i2 ‹ U{(i,.)} i.I1 3 Q2 ‹ { i2} 4 while Q . O 5 do q2 ‹ head[Q] 6 if (there exists (q, w).q2 such that q.F1 ) 7 then F2 ‹ F2 .{q2 } 8.2 (q2 ) ‹ w 9 for each a such that (q,w) .q2 and .1(q,a) de­fined do .2 (q2 , a) 10 . . ‹ . w · . .1 (q, a, q') (q,a).J1 (a).. q'..1 (q,w) .. .2 (q2 , a)11 -1 ‹ U{(q', [. (q , a)] w ·.(q, a, q'))} 22 1 (q,w,q').J (a) 2 12 if (.2 (q2 , a) is a new state) 13 then Enqueue(Q,.2 (q2 ,a)) 14 Dequeue(Q) Figure 3: Pseudocode for determinization algorithm [3]. “A/”E “A/”E b/p t/- e/t ./@ b/p t/-e/t n/@ ./n“A/”E Figure 4: Union operation done on a few word items in the German phonetic lexicon (T1). ./@ “A/”E b/p t/- e/t ./n n/@ “A/”E Figure 5: Finite-state transducers T2 (above) and T3 (below) obtained after performing determinization and prefixation algorithms on finite-state transducer showed in Fig. 4. 3.3 Minimization of finite-state transdu­cers Sequential transducers allow very fast look-up. But transducers can also be minimized. Minimization algo­rithms help to make them also space efficient [1][2][4][7]. The whole minimization procedure for se­quential transducers consists actually of two different algorithms. One is algorithm for computation of the pre­fix of a non-deterministic automaton [4] and the other is classical algorithm for minimization of automata [1][2]. In this section we will present the algorithm for compu­tation of the prefix, as it is independent of the concept of sequential transducers and will describe the entire algo­rithm that allows derivation of minimal sequential trans­ducers. In the algorithm described, we use the following nota­tion: • GT the transpose of G (the automaton obtained from G by reversing each transition); • Trans[u] the set of transitions leaving u .V; • TransT[u] the set of transitions entering u .V; • t.v the vertex reached by t and t.l its label, for any transition t in Trans[u] (resp. in TransT[u]), u .V; • out-degree[u] the number of edges leaving u .V; • in-degree[u] the number of edges entering u .V; • E the set of edges of G. 1. First we compute .u, the greatest common prefixes of all its leaving transitions: .u ‹( . t.l Xt.v ) .( . t.l) if u .F, t .[] .[] Trans u t Trans u t .v.scc t.v.scc .u ‹. else; 2. Then if .u .., we can make a change of variables: Yu ‹.uXu. This second step is equivalent to storing the value .u and solving the system modified by the following operations: M. Rojc et al. -1 .t .Trans[u], t.l ‹.u t.l, [] ‹t.l.u . .t .TransT u , t.l The number of times these two operations are performed can be limited by storing in array N the number of empty labels leaving each state u of the strongly connected component scc. While N[u] .0, there is no use to perform these operations as the value of .u is .. Also in the case that N[u] = 0 right after the computation of .u, the .u will remain equal to ., as changes of variables will only affect suffixes of the transitions leaving u. This in­formation can be stored using an array F, in order to avoid performing step 1 in such situations or when u is a final state. In the algorithm we use a queue Q containing the set of states u with N[u] = F[u] = 0 for which the two operations above need to be performed, and an array INQ indicating for each state u whether it is in queue Q. The above operations are started by initializing N and F to 0 for all states in scc, and by enqueuing in queue Q an arbitrarily chosen state u of the strongly connected com­ponent scc. Each time the transition of a state v of TransT[u] is modified, v is added to Q if N[v] = F[u] = 0. The property of SCC’s (strongly connected compo­nent) and the initialization of N and F assure that each state of scc will be enqued at least once. Steps 1 and 2 are operated until queue Q = Ø. This must happen as, except for the first time, step 1 is performed for a state u if N[u] = 0. After the computation of the greatest co­mmon prefix we can have N[u] = 0 and then u will never be enqueued again, or N[u] .0 and then a new non empty factor .u of P(u) has been identified. It is ob­viously then, that each state u is enqueued at most (|P(u)|+2) times in Q, and after at most (|Pmax|+2) steps we have Q = Ø. Prefix_Computation(G) 1 for each u .V(GSCC) 2 do for each v .SCC[u] 3 do N[v] ‹INQ[v] ‹ F[v] ‹0 4 Q ‹v 5 INQ[v] ‹1 6 while Q .0/ 7 do v ‹ head[Q] 8 Dequeue(Q) 9 INQ[v] ‹0 10 p ‹ GCP(G,v) 11 for each t .TransT[v] 12 do if(p ..) 13 then if(t.v . SCC[v] and N[t.v] > 0 and t.l = .and F[t.v] = 0) 14 then N[t.v] ‹ N[t.v] – 1 15 t.l ‹ t.l p 16 if(N[t.v] = 0 and INQ[t.v] = 0 and F[t.v] = 0) 17 then Enqueue(Q,t.v) 18 INQ[t.v] = 1 Figure 6: Pseudocode for the prefixation algorithm on finite-state transducers [4]. Once Q = Ø, it is easy to see that the system of equa­tions has a trivial solution: .u .scc, Xu = .. It has a unique solution. Therefore, the system is resolved. Con­catenating the factors .u involved in the changes of vari­ables corresponding to the state u gives the value of P(u). The set of operations (2) are obviously equivalent to multiplying the label of each transition joining the states u and v, (v .scc), at right by P(v) and at left by [P(u)]-1 if u is in scc. Thus the transformations described above do modify the transitions leaving or entering states of scc as desired. The above pseudocode gives an Informatica 28 (2004) 159–165 163 4 Results For the lexicons compilation the German large scale phonetic and morphology lexicons (CISLEX) [5] were used. In compilation process a large set of proprietary programs written in C++ that perform efficiently many operations on finite-state transducers and finite-state automata including determinization, minimization, union, intersection, compaction, prefixation, local extension and others were used. NUMBER OF algorithm that computes p(G) from G. In the algorithm, V(GSCC) represents the set of states of the component graph of G. For each u in V(GSCC), SCC[u] stands for the strongly connected component corresponding to u. The these transitions by dividing them at left by p and counts and stores in N[u] the number of empty transitions. If a)N[u] = 0 after the computation of the greatest common prefix or if u is a final state, the F[u] becomes the value NUMBER OF function GCP(G,u) called in the algorithm is such that it returns p, which is the greatest common prefix of all transitions leaving u (p = .if u .F). It replaces each of 1. The computation of the greatest common prefix of n (n>1) words requires at most (|p|+1).(n-1) comparisons, where p is the result of this computation [4]. This opera­tion consists of comparing the letters of the first word to those of the (n-1) others until a mismatch or end of a word occurs. The same comparisons allow to obtain the division at left by p and the number of empty transitions. In case only one transition leaves v, the computation of the greatest common prefix can be assumed to be in O(1). Therefore, the cost of a call of the function GCP for a state v (.V - F) is O((|p|+1)(out-degree(v)-1)+1). Here p is the greatest common prefix of the transitions leaving v. Figure 8: Achieved reduction of the number of states obtained in the first step of compilation process – 10 ran­ domly chosen transducers (a: phonetic lexicon. b: morphology lexicon.) t/. e/. Figure 7: Finite-state transducer T4 obtained using minimization algorithm in the sense of automata from T3. Given a sequential transducer T, the application of the prefix computation algorithm [4] to the output automaton of T has no effect on the states of T or on its transition function. Only the output function . of T is changed. A minimal ST, that computes the same function as T, can be obtained by applying the prefix computation algorithm to the output automaton of T, and also the minimization algorithm in the sense of automata, to the resulting trans­ducer [1][2]. Fig. 5 (transducer T3 ) shows the result ob­tained after performing prefix computation algorithm on sequential transducer T2 in particular case. The applica­tion of the prefix computation algorithm on T2 leads to the transducer T3, which computes the same function. Only outputs differ from those of T2. In Fig. 7 the final obtained transducer is presented using minimization al­gorithm in the sense of automata. n/n During construction of corresponding finite-state trans­ ducers, the following algorithms were used: union, deter­ minization, prefix computation and classical minimiza­tion algorithms of finite-state automaton (Aho, Sethi, and Ullman; Hopcroft; Watson) [1][2][12]. Prefix computa­tion algorithm was used before minimization algorithms. It pushes the output labels towards the initial state as much as possible. All lexicons represent part of the language dependent external knowledge for morphology and grapheme-to­phoneme modules in the multilingual text-to-speech processing system. The starting sizes of phonetic lexicon and morphology lexicon were 12.52 MB and 18.49 MB. Both lexicons contained 300.000 items. Final size of corresponding finite-state transducer was 2.78 MB (120.386 states) for the first one and 6.33 MB (183.123 states) for the second. In the first step of compilation, the great reduction of the number of states was achieved, what is evident from Fig. 8. This is also the reason, why we follow the procedure described under subsection 2.2 (Fig. 2). The number of 164 Informatica 28 (2004) 159–165 M. Rojc et al. states decreased already after determinization algorithm NUMBER OF OUTPUT CODES as expected. The number of states obviously does not 120 change after performing prefix computation algorithm. This algorithm works only on the output automaton of 100 the corresponding transducer. It pushes back outputs as 80 much as possible toward the initial state. The effect of 60 prefix computation algorithm can be noticed only at the end of the compilation process, when much smaller fi­nite-state transducers are obtained than in the case when only classical minimization algorithm after determiniza­tion would be performed. The answer for that can be FINITE-STATE TRANSDUCERS found from the Fig. 5 (transducers T2 and T3). In the transducer T3 we have after performing prefix computa­tion algorithm newly created ./. transition labels. This empty transition labels are result of pushing back outputs toward the initial state. That’s why after performing minimization algorithm in the sense of automata much smaller transducers are obtained. b) Figure 9: Increasing number of output codes in the first step of compilation process – 10 randomly chosen trans­ducers (a: phonetic lexicon. b: morphology lexicon.) NUMBER OF In the Fig.9 we see that the number of output codes has increased after determinization and prefix computation algorithm were performed. In case that the prefix com­putation would not be performed, final number of codes would be significantly smaller, but the final transducer would also be much bigger (more states and transitions). According to the experiments it only makes sense to have more codes and much smaller transducer. S4 FINITE-STATE TRANSDUCERS a) It is also interesting that in compilation of morphology lexicon, much more output codes is generated as in the NUMBER OF STATES case of phonetic lexicon (Fig. 9). The reason is that the morphology lexicon comprises much more information than the phonetic lexicon. In the second step of the compilation process, the same situation regarding the state reduction can be observed as in the first step (Fig. 10). Only reduction of the number of states is smaller and there is no significant increase of the number of output codes (Fig. 11). In Table 5 the final results for the obtained finite-state transducers for German phonetic and morphology lexicons are given. The number of input codes is the same for both lexicons and the number of output codes is two-times bigger in case of morphology lexicon. The reason is that the information field in the morphology lexicon is substantial longer (Fig. 1). FST1 FST2 Number of input codes 61 Number of output codes 34.879 87.204 Size of output vocabulary 343 KB 3.6 MB Number of states 112.498 169.613 Number of transitions 200.801 325.839 Size of bin file 2.78 MB 6.33 MB Table 5: The final finite-state transducers representing German phonetic (FST1) and FINITE-STATE TRANSDUCERS German morphology lexicon (FST2). a) NUMBER OF OUTPUT CODES a) NUMBER OF OUTPUT CODES b) Figure 11: Increasing number of output codes in the second step of compilation process – 10 randomly cho­sen transducers (a: phonetic lexicon. b: morphology lexicon.) 5 Conclusion Performing determinization of finite-state transducer obviously results in significant decrease in number of states. One disadvantage of the determinization algo­rithm is that the outputs are pushed toward final states that create a long delay in emission. But using prefix calculation algorithm before classical minimization al­gorithms for automata, the problem can be efficiently resolved. An important characteristic of this algorithm is that it pushes back outputs as much as possible toward the initial state. As expected, the number of states re­mains unchanged after performing prefix calculation algorithm. The efficiency of this algorithm can be seen only after performing classical minimization algorithm, when much smaller number of states is obtained than in case if only determinization and minimization algo­rithms would be performed. From table 5 it can be seen that finite-state transducers can efficiently represent large lexicons. They provide fast look-up time, double side look-up, and compactness. 6 References [1] Bruce William Watson, Taxonomies and Toolkits of Regular Language Algorithms, PhD Thesis, Eindho­ven University of Technology and Computing Sci­ence, 1995. [2] Watson, B.W., A taxonomy of finite automata mini­mization algorithms, Computing Science Report 93/44, Eindhoven University of Technology, The Nederlands, 1993. Informatica 28 (2004) 159–165 165 [3] Mehryar Mohri, On Some Applications of Finite-State Automata Theory to Natural Language Proce­ssing, Natural Language Engineering 1, Cambridge University Press, 1996. [4] Mehryar Mohri, Minimization Algorithms for Se­quential Transducers, Theoretical Computer Sci­ence, 234:177-201, March 2000. [5] Guenthner, F.&P. Maier, Das CISLEX Woerterbuch system, CIS-Bericht-94-76. [6] Aho, Alfred V., John E. Hopcroft, and Jeffrey D. Ullman, The design and analysis of computer algo­rithms. Addison Wesley: Reading, MA 1974. [7] Bauer, W, On minimizing finite automata, EATCS Bulletin, 35 1988. [8] Berstel, Jean and Cristophe Reutenauer, Rational Series and Their Languages, Springer-Verlag: Ber-lin-New York 1988. [9] Crochemore, Maxime, Transducers and repetitions, Theoretical Computer Science, 45 1986. [10] Hopcroft, John E. and Jeffrey D. Ullman, Intoduc­tion to Automata Theory, Languages, and Computa­tion. Addison Wesley: Reading MA 1979. [11] Kuich, Werner and Arto Salomaa, Semirings, Auto­mata, Languages, Number 5 in EATCS Monographs on Theoretical Computer Science. Springer Verlag, Berlin, Germany 1986. [12] Mehryar Mohri, Language Processing with Weighted Transducers, In Proceedings of the 8th annual Traitement Automatique des Langues Naturelles (TALN 2001). Tours, France, July 2001.