https://doi.org/10.31449/inf.v43i4.2520 Informatica 43 (2019) 467–476 467 String Transformation Based Morphology Learning László Kovács Institute of Information Technology, University of Miskolc, Miskolc-Egyetemváros, H 3515, Hungary E-mail: kovacs@iit.uni-miskolc.hu and https://www.iit.uni-miskolc.hu Gábor Szabó Institute of Information Technology, University of Miskolc, Miskolc-Egyetemváros, H 3515, Hungary E-mail: szgabsz91@gmail.com and https://www.iit.uni-miskolc.hu Keywords: machine learning, natural language processing, inflection rule induction, agglutination, dictionaries, finite state transducers, tree of aligned suffix rules, lattice algorithms, string transformations Received: October 10, 2018 There are several morphological methods that can solve the morphological rule induction problem. For different languages this task represents different difficulty levels. In this paper we propose a novel method that can learn prefix, infix and suffix transformations as well. The test language is Hungarian (which is a morphologically complex Uralic language containing a high number of affix types and complex inflection rules), and we chose a previously generated word pair set of accusative case for evaluating the method, comparing its training time, memory requirements, average inflection time and correctness ratio with some of the most popular models like dictionaries, finite state transducers, the tree of aligned suffix rules and a lattice based method. We also provide multiple training and searching strategies, introducing parallelism and the concept of prefix trees to optimize the number of rules that need to be processed for each input word. This newly created novel method can be applied not only for morphology, but also for any problems in the field of bioinformatics and data mining that can benefit from string transformations learning. Povzetek: Predstavljena je nova metoda za morfološko uˇ cenje na primeru madžaršˇ cine. 1 Introduction In the area of natural language processing (NLP), word structure is an essential information for higher layer anal- ysis such as syntax, part of speech tagging, named entity detection, sentiment and opinion analysis, and so on. The main difference between syntax and morphology is that while syntax works on the level of sentences, treating indi- vidual words as atoms, morphology works with intraword components. According to morphology models, the words are built up using morphemes that are the smallest morphological units that encode semantic information. There are two types of morphemes: the lemma is the root, grammati- cally correct form of a word that’s associated with the base meaning; while affixes are usually shorter character strings that slightly modify the meaning of the words. These affixes are language dependent, and can be prepended (incorrect), appended (flying) or simply inserted into the words. Prepended affixes are called prefixes, appended af- fixes are called suffixes, while affixes inserted inside the words are called infixes. This latter category is rare in most languages, one example is the Latin verb vinc¯ o where the n denotes present tense. The addition of affixes is called inflection, while the inverse opereation is called lemmati- zation. Languages can be categorized into six main groups based on their morphological features [1]. Analytic lan- guages such as English have a fix set of possible affixes for each part-of-speech category. Isolating languages like Chi- nese and Vietnamese usually have words that are their own stems, without any affixes. Languages that have only a few affix types usually use auxiliary words and word position to encode grammatical information. In intraflective lan- guages (Arabic, Hebrew), consonants express the meaning of words, while vowels add the grammatical meaning. Syn- thetic languages have three subcategories: polysynthetic languages like Native American languages contain com- plicated words that are equivalent to sentences of other languages; in fusional languages such as Russian, Polish, Slovak, Czech, the morphemes are not easily distinguish- able and often multiple grammatical relations are fused into one affix; agglutinative languages like Hungarian, Finnish, Turkish have many affix types and each word can contain a large number of affixes. For different languages there are different models that can be used to learn morphological rules, as morphology is a language dependent area. Creating such models is a com- plex task, especially for agglutinative languages. In the lit- erature we can find approaches that are based on suffix trees and error-driven learning [2] to optimally store transforma- tion rules and search among them. Hajic [3] proposed a generalized grammar model, 468 Informatica 43 (2019) 467–476 L. Kovács et al. suitable for both the synthetic and agglutinative lan- guages. The author introduces a controlled rewriting sys- tem CRShA;V;K;t;Ri, where A is the alphabet, V is the set of variables,K contains the grammatical meanings (morphological categories), t maps the variables to types and R is a set of atomic rewrite rules. The substitution operation defined in the rewrite rules replaces all variables with some string, all instances of the same variable is re- placed by the same string. The main parameters of an ele- mentary substitution rule include the input state id, the out- put state id, the variable id, the morphological category and the resulting string. The article provides a formal frame- work to describe the transformation process, but it does not detail the rule generation process, since the model assumes that the rule set is constructed by human experts. In the two-level morphology model [4], the inflected words are represented on two levels. The outer or sur- face level contains the written form of the words, while the inner or lexical level contains the morphological struc- tures. For example, the surface level word ”tries” is related to the lexical level ”try+s”. The lexical level represents the morphological categories and separator symbols for the surface form. The model uses a dictionary to store the valid lemmas and morpheme categories. The transformation be- tween the lexical level and the surface level is implemented with a set of finite state transducers. A transducer is a spe- cial automaton that can model the string transformations. FSTs (finite state transducers) are widely used to man- age morphological analysis for both generation and recog- nition processes. One of the main issues related to this model is the computational complexity of the implementa- tions. It was shown that it is inefficient to work with com- plex morphological constraints [5], where there are com- plex dependencies among the different morpheme units, like vowel harmony. The analysis shows that both recog- nition and generation are NP-hard problems. One of the most widely known approaches to construct an FST is the OSTIA method [6, 7]. It first generates a prefix tree trans- ducer, then merges all the possible states, pushes some out- put elements toward the initial state and eliminates all the non-deterministic elements. The OSTIA algorithm was later improved by Gildea and Jurafsky [8]. They extended the algorithm with a better similarity alignment component. Theron and Cloete [9] proposed a more general method based on edit-distance similarities of the base and inflected words. The algorithm learns the two-level transformation rules, calculating the string edit difference between each source-target pair and determining the edit sequences as a minimal acyclic finite state automaton. The constructed automaton can segment the target word into its constituent morphemes. The algo- rithm determines the minimal discerning context for each rule. This processing phase is done by comparing all the possible contiguous contexts to determine the shortest con- text. Regarding current achievements, one important ap- proach is presented in [10] and [11]. In the proposal of Goldsmith, a simplified morphology model is used con- taining substitution of suffixes. The words are decomposed into sets of short substrings, where the substrings have a role similar to the morphemes. The proposed method uses the concept of minimal description length to determine the appropriate word segmentations. Another popular and simple method is the so-called tree of aligned suffix rules (TASR) [12] that is a great match for morphological rule induction: it can be built very quickly according to previous evaluations and can be searched very quickly as well, providing an outstanding correction ra- tio. Unlike dictionary based systems and FSTs, the TASR method can inflect even previously unseen words correctly. The only downside of this model is that it can only han- dle inflection rules that modify the end of the input word. In Hungarian we must be able to describe not only suffix rules, but also prefix and infix rules. Besides trees, there are existing models that use lattice structures to store transformation rules. The goal of [13] is to optimize the lattice size by dropping rules that have a small impact on the overall results. The rule model uses similar concepts to the Levenshtein model like additions, removals and replacements. The paper shows that this lat- tice based model has a very promising memory constraint, fast inflection time and a correctness ratio of almost 100%. In this paper we present a novel model called Atomic String Transformation Rule Assembler (ASTRA) whose base concept is similar to TASR, but can handle any types of affixes, including prefixes, infixes and suffixes as well. Our test language is Hungarian, a morphologically com- plex, highly agglutinative language that is frequently tar- geted by morphological model researchers due to its com- plexity. In Hungarian, there are a high number of affix types that can form long affix type chains, moreover each affix type can modify the base form significantly, using vowel gradation and changing consonant lengths. The in- flection rules of the language are complex, and there are several exceptions, too. Besides morphological rule in- duction, our model is capable of dealing with any other string transformation based problems as well. Such prob- lems can be found in the area of biological informatics (e.g. investigating DNA sequences) and data mining (e.g. pre- processing of data including spelling correction and data cleaning). The structure of this paper is the following: – Section 2 introduces the reference methods: dictio- nary based systems, finite state transducers, the tree of aligned suffix rules and the lattice based method. – Section 3 describes the novel ASTRA method: its rule model, training phase and inflection phase. We also introduce three search algorithms to speed up inflec- tion. – The evaluation of the proposed method can be seen in section 4. The four metrics we measure and compare with the base methods are the training time, average inflection time, size and correctness ratio. String Transformation Based Morphology Learning Informatica 43 (2019) 467–476 469 – In section 5 we present a general application of the ASTRA model. 2 Background 2.1 Dictionary Based Models One of the most basic methods for learning inflection rules is using dictionaries. A dictionary can be considered as a D W W relation for morphological usage: for each input word it can return an output word. Usually dictionaries not only contain the inflected forms of words, but also other semantic information like their meaning, part-of-speech tag, sample sentences and so on. There are many language dependent WordNet projects [14, 15] whose goal is to build such databases. Besides au- tomatic data mining techniques, these databases are often validated and corrected by human experts. Because of the large magnitude of data (the Hungarian WordNet contains more than 40,000 synsets, i.e. word sets with the same meaning), dictionaries can take much time to build. Their advantage is that irregular morphological forms are guaranteed to be retained, they aren’t dropped by generalization techniques. Besides the training time, the downside of dictionaries is the lack of generalization: other automated methods usually can handle previously unseen words, too, but dictionaries can only inflect and lemmatize words they know. 2.2 Finite State Transducers Finite state automaton (FSA) is the base model for finite state transducers. An FSA is anA = hQ; ;q ;E;Fi whereQ is the finite set of states, is the input alphabet, q is the start state,E : Q ! Q is the state transition relation andF is the set of accepting states. Finite state transducers (FST) [7] extend this model with additional components, as well as with outputting strings. There are multiple transducer models. A ratio- nal transducer is aT =hQ; ; ;q ;Ei whereQ, and q are the same as for an FSA; is the output alpha- bet and E (Q Q) is the state transition relation. In practice, = . A sequential transducer is almost the same, except for two additional conditions: E (Q Q) and8 (q;a;u;q 0 ); (q;a;v;q 00 )2 E ) u = v and q 0 = q 00 . A subsequential transducer is a special sequential transducer that has a sixth compo- nent: : Q! that is the state output function. Such transducer works in the following way: each input charac- ter causes a state transition and the label of this transition is appended to the output string. Finally, the ending state’s output is also appended, resulting in the final output string. A transducer is onward if for every state, the state’s out- put and the state transitions’ outputs starting from this state have no common prefixes. FSTs are used extensively while working with string transformations, because they have optimal sizes and can produce the output almost in constant time. However, as we’ll see, with morphological applications, their general- ization ability is not really usable. 2.3 Tree of Aligned Suffix Rules There are 3 main types of substrings that can change in a word during inflection: prefixes, suffixes and infixes. The substring pre2 ;jprej> 0 is a prefix of the strings 1 2 if there exists another string s 2 2 such that s 1 = pre +s 2 . Similarly, the substring suff2 ;jsuffj> 0 is a suffix of the strings 1 if there exists another strings 2 such thats 1 =s 2 + suff. The substring inf2 ;jinfj> 0 is an infix of the strings 1 if there exist two other stringss 2 ;s 3 such thats 1 =s 2 + inf +s 3 wherejs 2 j> 0 andjs 3 j> 0. The TASR model can only work with morphological rules that modify the end of the words, meaning that it can only model suffix transformations. This restriction is ac- ceptable for morphologically simpler languages, but com- plex agglutinative languages often contain prefix and infix transformation rules as well. The goal of the TASR learning phase is to generate a set of suffix rules from a training word pair set. This set of rules is denoted byR T =fR T g in this paper. A suffix rule consists of two components: R T = ( T ; T ) where T ; T 2 . Here, T contains the word-ending charac- ters that are modified by the rule, and T contains the re- placement characters. As an example, for the English verb try whose past tense is tried, we can generate a suffix rule where T = y and T = ied. The ruleR T1 =f T1 ; T1 g is aligned with ruleR T2 = f T2 ; T2 g or shortlyR T1 k R T2 if8s 1 2 :9s 2 2 such thats 1 + T1 =s 2 + T2 ands 1 + T1 =s 2 + T2 . The aligned-with operator is symmetric, so R T1 k R T2 () R T2 kR T1 . If we have a word pair, for example (try, tried) we can generate multiple aligned suffix rules. The minimal suffix rule is (y, ied), and after extending this rule with one char- acter at a time, we get (ry, ried) and (try, tried). We can define a frequency metric freq (R T jI) for each rule R T based on the training word pair set I = f(w 1 ;w 2 )jw 1 ;w 2 2Wg, counting the number of word pairs for whichR T applies. For every word pair in the training set, we must first gen- erate all the aligned suffix rules according to the above def- initions and insert these rules in a tree (T; ). This tree will consist of nodesn T1 ;n T2 ;:::;n Tm , each noden Ti as- sociated with a set of rulesn Ti 7! R Tij = Tij ; Tij . All the rules associated with the same node have the same context. Let’s have two nodes: n T # and n T " . They are asso- ciated with the rules R T #i = T #i ; T #i and R T "j = T "j ; T "j , respectively. Then T # node is the child ofn T " or shortlyn T # n T " if9x2 : 8i;j : T #i =x + T "j . The root node and rules are denoted by n T * 7! R T *k = T *k ; T *k . For the root, the following con- dition applies:8k : T *k = min ij Tij . 470 Informatica 43 (2019) 467–476 L. Kovács et al. Child ruleR T # = T # ; T # is subsumed by parent rule R T " = T " ; T " (R T # < R T " ) if T # = x + T " and T # =x + T " wherex2 . After these definitions, we can define which rule is the winning rule of node n T # among the associated R T #i = T #i ; T #i rules. Let n T " be the parent node with rules R T "j = T "j ; T "j . The winner rule is ^ R T # = R T #k such that freq R T #k jI = max i freq R T #i jI and @j :R T "j >R T #k . After that we can build the tree from the generated rules. Typically the most general rules will be close to the root node, while the most specific rules will be stored in the leaves. Therefore, during inflection we can search the tree in a bottom-up fashion, returning the winner rule of the first node we find whose context matches the input word. Since we start at the leaves, the first matching rule will be the most specific one, having the longest context. This means that the resulting inflected form will mirror the main char- acteristics of the training data. 2.4 Lattice Based Method The rule model of the examined lattice based inflection method [13] is a six-tuple R = (;;!; ! ; ; h i i), where – 2 is the prefix of the rule containing the charac- ters before the changing part, – 2 is the core of the rule that is the changing part, – !2 is the postfix of the rule containing the char- acters after the changing part, – ! 2 N is the front index of the rule’s context occur- rence in the source word, – 2 N is the back index of the rule’s context occur- rence in the source word and –h i i is a list of simple transformation steps on the core, i . These rules are generated automatically from training word pairs, then inserted into a lattice structure, where the parent-child relationship is based on rule context contain- ment. In the original paper we formalized multiple lattice builder algorithms that tried to reduce the size of the result- ing lattice. The best builder only inserts those rules and in- tersections into the lattice that are really responsible for the high correctness ratio, every other redundant rule is elimi- nated. As we’ll see, the size characteristics of this model is very promising, but because of the high degree of generaliza- tion, the lattice can inflect some words incorrectly. This is due to the overgeneralization effect of the lattice model itself. 3 Atomic string transformation rule assembler The goal of the Atomic String Transformation Rule As- sembler (ASTRA) model is to collect atomic, elementary patterns from a training word pair set during the training phase, and use the best matching atomic rules for each input word during the production phase. For these in- puts, every matching, non-overlapping atomic rule is ap- plied to produce the correct inflected form. As discussed previously, using these concepts, the proposed method can model prefix, infix and suffix inflection rules as well, thus can be used for morphologically complex agglutinative lan- guages. First of all, we define an extended alphabet so that it is easier to determine where a word starts and ends. Let’s introduce two special characters, $ that will mark the start of the word and # that will mark the end of the word. If a rule’s context contains any of these two special characters, it will be easier to determine if the beginning or the end of the word needs to be transformed. Of course these characters are not part of the original alphabet. The extended alphabet will be denoted by = [f$; #g. We also define a new operator on strings that prepends $ and appends # to the strings: (w) = w = $+ w + #. The inverse operation drops the special characters from the input word: 1 ( w) = w. The set of extended words is denoted by W . The input of the training process for the new method is the same set of word pairs containing the base form and inflected form of the word, but the first step of the algo- rithm is to extend these word pairs with our new special characters. After the extension, we get a new training set I =f( w 1 ; w 2 )g. We split each word pair to matching segments w 1 = 1 1 2 1 ::: k 1 w 2 = 1 2 2 2 ::: k 2 A segment i 1 ! i 2 is called variant if i 1 6= i 2 , otherwise it is called invariant. In a segment decomposition, variant and invariant segments are alternating. As one word pair might have multiple segment decom- positions, we need to select the best one among them. To quantify the goodness of the decompositions, we use a segment fitness formula that returns how well-aligned the i 1 ! i 2 segment is: 1 1 index max index min + 2 i 2 where index max and index min are the maximal and mini- mal indices of theith segment, i.e. the maximum and min- imum of the indices P i 1 j=1 j 1 and P i 1 j=1 j 2 , respec- tively. This formula encodes that invariant segments are better if their components are longer and the two compo- nents appear near to each other. String Transformation Based Morphology Learning Informatica 43 (2019) 467–476 471 Example 3.1. Let us choose a training word pair (dob, ledobott) 1 as an example to demonstrate the segment de- composition algorithm. First, the words are extended with the special characters: ($dob#, $ledobott#). One valid segment decomposition is the following: ( 1 1 = $, 1 2 = $le), ( 2 1 = dob, 2 2 = dob), ( 3 1 = #, 3 2 = ott#). The middle segment is invariant, while the first and last ones are variant segments. For each variant segment, we can define so-called atomic rules in the form of R A = ( A ; A ; A ;! A ) where A is the prefix and ! A is the suffix. The rule context that must be searched in the input words later is A (R A ) = A + A +! A . We can see that with this rule model, not only suffix rules can be modelled, because of the new A and! A components. Let’s take a variant segment i 1 ! i 2 . First, we need to define the core atomic rule R Aic = ( Aic ; Aic ; Aic ;! Aic ) for this segment that has no pre- fix or postfix, i.e.j Aic j = 0, Aic = i 1 , Aic = i 2 and j! Aic j = 0. Then, we can extend this core atomic rule with one char- acter at a time on the left and right sides, symmetrically. Let’s assume that P i 1 j=1 j 1 =n, P k j=i+1 j 1 =m and i 1 = l. In this case, the extended rule candidates are R Aij = Aij ; Aij ; Aij ;! Aij with the following com- ponents (81 j minfn;mg): Aij = w 1 [n + 1 j; n] Aij = i 1 Aij = i 2 ! Aij = w 1 [n +l + 1; n +l +j] Here,w [i;j] denotes the substring ofw from theith to thejth character. To make the generated atomic rules unambiguous, we have to make sure that the context of the rules only ap- pear once in the base form of the word ( w 1 ). Every atomic rule candidate whose context appears more than once in the base form of the word is dropped from the final set. Example 3.2. Using the winning segmentation of example 3.1, the following atomic rules can be generated from the word pair (dob, ledobott): ( ; $; $le; ), ( ; $; $le;d), ( ; $; $le;do), ( ; $; $le;dob), ( ; $; $le;dob#), ( ; #;ott#; ), (b; #;ott#; ), (ob; #;ott#; ), (dob; #;ott#; ), ($dob; #;ott#; ). Transforming a word w 2 W using the atomic rule R A = ( A ; A ; A ;! A ) can be defined as A (R A ; w) = ( w if A (R A )6 w; or wn A (R A ) [ A ! A ] where wn A (R A ) [ A ! A ] means that we need to search A (R A ) in w, and replace A with A . 1 Hungarian for (throw, threw down). Note that we add two affixes, one for the past tense and one preverb for down. The base form of the method doesn’t require to build a tree, we can simply group the atomic rules based on their contexts. A rule group is defined as a set of atomic rules A =fR Ai = ( Ai ; Ai ; Ai ;! Ai )g where8R Ai ;R Aj 2 A : A (R Ai ) = A R Aj . The context of the rule group is A ( A ) = A (R A )8R A 2 A . Example 3.3. For the atomic rules of example 3.2, we can produce nine different rule groups, each contain- ing a single atomic rule except for the rule group with context $dob# that contains both ( ; $; $le;dob#) and ($dob; #;ott#; ). The goal of the training phase is to produce a set of rule groupsR A =f A g based on the training word pair set I. The generated atomic rule set can be used to inflect the given input words based on the training word pair set. For each input, our goal is to choose some atomic rules that match the input word. Rules with longer matching sub- strings in the input word are better than rules with shorter matching substrings. The fitness function is f (R A j w) = j (R A )j j wj k ( (R A ); w) wherek is a parameter and the function returns how simi- lar the rule context is to the input word. To simplify things, we used k = 1 and a discrete function that returns 1 if (R A ) w, and 0 otherwise. Using this fitness function, we can choose the first n atomic rules that are best suited for the given input word where n is a parameter. We implemented three separate candidate selector algorithms. The first one is a sequential algorithm that processes each rule group one by one. If a rule group’s context matches the input word, its atomic rules are added to the resulting set of candidate rules. The second one is a parallel algorithm that does the same thing in a divide and conquer manner, processing the rule groups in parallel. The number of threads depends on the num- ber of our CPU cores. The third one uses a prefix tree that is built from the rule groups during the training phase. With the prefix tree, we can speed up the candidate search process by searching substrings of the input words. If a substring is found in the prefix tree, the appropriate rule group’s atomic rules are added to the resulting set. Since there might be multiple overlapping rule candi- dates that would transform the same substring of the word leading to ambiguity, among these rules only the first one is used, the others are dropped. After we chose the best non-overlapping rules, we can apply them one by one on the input word, producing its inflected form. 4 Evaluation of the proposed method For evaluation purposes, we used a training word pair set generated by [16]. We chose the Hungarian accusative case 472 Informatica 43 (2019) 467–476 L. Kovács et al. 1e-02 1e-01 1e+00 1e+01 1e+02 0 2500 5000 7500 10000 Number of Training Word Pairs Training Time [s] Dict FST TASR Lattice ASTRA ASTRA+Pref (a) Training time of the methods 1e-06 1e-04 1e-02 0 2500 5000 7500 10000 Number of Training Word Pairs Average Inflection Time [s] Dict FST TASR Lattice ASTRA+Seq ASTRA+Par ASTRA+Pref (b) Average inflection time of the methods Figure 1: Training time and average inflection time of the methods as our target affix type and used up to 10,000 training word pairs. We compared a custom dictionary implementation, Lucene’s FST method, the TASR model, the previously mentioned lattice based method and the proposed ASTRA method, measuring their training times, their average in- flection times, the sizes of their rule base and their correct- ness ratios, i.e. how much percent of evaluation words are inflected correctly after the training phase. IfW + is the set of evaluation words for which the model yields a correct in- flected form, andW is the set of failed evaluation words, then the correctness ratio is W + = (W + +W ). Where applicable, we also measured the differences using the se- quential, parallel or prefix tree search algorithm in case of ASTRA. In Figure 1a we can see the training time of the methods, using logarithmic scale for the y axis. As we can see, there are three different clusters based on the training time. The fastest solution is to store the already available set of word pairs in a dictionary, because we only have to store these records, no extra processing occurs. Building an FST is the next in line, but it has very similar characteristics to the AS- TRA method. If we include the prefix tree building as well, the ASTRA’s training time increases a bit. The third clus- ter consists of the TASR and the lattice based methods. It can be seen that building a tree of aligned suffix rules takes more time as the previous methods, and the complexity of the lattice adds even more time to the TASR’s results. Figure 1b shows the average inflection time of the meth- ods. As we can expect, if we use an appropriate hash func- tion in the dictionary implementation, retrieving the match- ing record for each input word becomes almost constant in time. The second best method as for average inflection time is the FST: it also has a very plain curve, but it’s a bit higher than the dictionary’s. ASTRA with a prefix tree comes next, but it’s very close to the line of the lattice based method. The remaining methods have much steeper curves: TASR comes next, but the parallel search function with AS- TRA is very close to it; while the worst inflection time is achieved by the sequential search function. Note that al- though the inflection time of the prefix tree search variant is the best for ASTRA, it means a bit overhead during the training time. However, even with this overhead, we can say that it’s worth using it. In Figure 2 we can see the overall size of the rule bases, i.e. the number of word pairs in the dictionary, states in the FST, nodes in the TASR and the lattice, and atomic rules in case of ASTRA. It is not surprising that there are more generated atomic rules in ASTRA than nodes in the tree of aligned suffix rules, since the atomic rule definition allows to have mul- tiple variant segments in a word pair and from these vari- ant segments, multiple core and extended atomic rules can be produced. On the other hand, TASR will only generate one minimal suffix rule per word pair and all of its aligned extensions. The advantage of the ASTRA model is that even with this higher number of rules and the prefix tree, we can train it faster than a TASR. Moreover it can cover more cases, including prefix, infix and suffix rules. The built FST has better size characteristics, because its builder algorithm merges every state that can be merged without losing information from the original training word pair set. It can be seen from the line of the dictionary that the num- ber of states in an FST and the number of rules in the AS- String Transformation Based Morphology Learning Informatica 43 (2019) 467–476 473 0 20000 40000 60000 0 2500 5000 7500 10000 Number of Training Word Pairs Size Dict FST TASR Lattice ASTRA Figure 2: Size of the rule bases TRA and TASR are higher than the number of input word pairs. However, the minimal lattice builder algorithm pro- duces an even better lattice size, as the number of nodes in the resulting lattice is lower than the size of all the other structures. Finally, Figures 3a and 3b show the correctness ratio of the models. The results of the left side were achieved by using disjoint training and evaluation word pair sets. We can see that the correctness ratio platoes a bit below 95% for TASR and ASTRA, the latter one performing a bit bet- ter. It can be also seen that the lattice based method is worse, probably because of its higher degree of generaliza- tion. When we examined the results of the lattice compared to TASR and ASTRA, we saw that in multiple cases the lat- tice found a node whose rule resulted in an invalid inflected form. The correctness ratio of the dictionary and the FST is 0%, because they could not generalize at all. For the dic- tionary, it is understandable, because a dictionary is a static map of word pairs. On the other hand, although an FST can generalize, these types of morphological applications don’t benefit from this generalization, as the generalized transformations do not result in real inflection rules. On the right side of the figure, we can see what happens if we use the first 100, 200, . . . 10,000 word pairs to train the methods, and then use the same 10,000 word pairs for evaluation. All the methods have an almost 100% correct- ness ratio at the end of the diagram. The only reason that we cannot reach 100% is that in the training word pair set there are records with the same lemma and different in- flected forms such as örömöt and örömet that are two valid inflected forms of the Hungarian word öröm (joy in En- glish). The difference resides in the characteristics of the curves. The dictionary and the FST cannot really gener- alize inflection rules, so their lines are linear. The other methods can reach higher percentages more quickly, but as we can see, the ASTRA method is even better than the TASR in that it can produce a better correctness ratio with a smaller number of training word pairs. The lattice based method is worse than TASR and ASTRA in this case as well. 5 General application of the ASTRA model One of the scientific areas of applying string algorithms including string transformation based methods is the area of bioinformatics and computational biology [17]. DNA sequences are modelled using strings of four characters matching the four types of bases: adenine (A), thymine (T), guanine (G) and cytosine (C). One of the goals of bioinfor- maics is to compare genes in DNAs to find regions that are important, find out which region is responsible for what functions and features and determine how genetic informa- tion is encoded. The process of DNA analysis is a very computational intensive task, that’s why modelling, statis- tical algorithms and mathematical techniques are important aspects of success. Besides applying string transformations, computational biology uses many string matching and comparison tech- niques as well [18]. Finding the longest matching sub- strings of two strings (DNA sequences) helps in finding the best DNA alignments and thus comparing different DNA sequences, finding matching parts and differences. One of the techniques used for this comparison is the applica- tion of the edit distance computation originally published by Levenshtein [19] for morphological analysis. Another application area where string transformation based methods are applied is data mining. Data mining en- gines usually consist of multiple phases to extract informa- tion out of unannotated training data such as long free texts. The first phase is often called data cleaning, where the raw input data is preprocessed so that invalid records are either removed or fixed before moving on with the data mining algorithms. One way to fix the typos and other errors in free texts is spelling correction. Spelling correction can be interpreted as learning those string transformations that can transform an unknown word containing typos to the clos- est known word. There are multiple techniques to solve this problem, usually iterative algorithms perform better as there can be multiple problems with a word that are eas- ier to fix in multiple steps [20]. The goal is to find a word w 2 W for any unknown string s so that their distance d (w;s)< is lower than an acceptable threshold. A third, more intuitive non-morphological application of the ASTRA model is character sorting. Let’s have a ran- dom strings2 with a given length ofjsj =n. The goal is to rearrange the characters ins so that for each indexi, 1 i