https://doi.org/10.31449/inf.v43i4.2520 Informatica 43 (2019) 467–476 467
String Transformation Based Morphology Learning
László Kovács
Institute of Information Technology, University of Miskolc, Miskolc-Egyetemváros, H 3515, Hungary
E-mail: kovacs@iit.uni-miskolc.hu and https://www.iit.uni-miskolc.hu
Gábor Szabó
Institute of Information Technology, University of Miskolc, Miskolc-Egyetemváros, H 3515, Hungary
E-mail: szgabsz91@gmail.com and https://www.iit.uni-miskolc.hu
Keywords: machine learning, natural language processing, inﬂection rule induction, agglutination, dictionaries, ﬁnite
state transducers, tree of aligned sufﬁx rules, lattice algorithms, string transformations
Received: October 10, 2018
There are several morphological methods that can solve the morphological rule induction problem. For
different languages this task represents different difﬁculty levels. In this paper we propose a novel method
that can learn preﬁx, inﬁx and sufﬁx transformations as well. The test language is Hungarian (which is a
morphologically complex Uralic language containing a high number of afﬁx types and complex inﬂection
rules), and we chose a previously generated word pair set of accusative case for evaluating the method,
comparing its training time, memory requirements, average inﬂection time and correctness ratio with some
of the most popular models like dictionaries, ﬁnite state transducers, the tree of aligned sufﬁx rules and a
lattice based method. We also provide multiple training and searching strategies, introducing parallelism
and the concept of preﬁx trees to optimize the number of rules that need to be processed for each input
word. This newly created novel method can be applied not only for morphology, but also for any problems
in the ﬁeld of bioinformatics and data mining that can beneﬁt from string transformations learning.
Povzetek: Predstavljena je nova metoda za morfološko uˇ cenje na primeru madžaršˇ cine.
1 Introduction
In the area of natural language processing (NLP), word
structure is an essential information for higher layer anal-
ysis such as syntax, part of speech tagging, named entity
detection, sentiment and opinion analysis, and so on. The
main difference between syntax and morphology is that
while syntax works on the level of sentences, treating indi-
vidual words as atoms, morphology works with intraword
components.
According to morphology models, the words are built
up using morphemes that are the smallest morphological
units that encode semantic information. There are two
types of morphemes: the lemma is the root, grammati-
cally correct form of a word that’s associated with the base
meaning; while afﬁxes are usually shorter character strings
that slightly modify the meaning of the words. These
afﬁxes are language dependent, and can be prepended
(incorrect), appended (ﬂying) or simply inserted into the
words. Prepended afﬁxes are called preﬁxes, appended af-
ﬁxes are called sufﬁxes, while afﬁxes inserted inside the
words are called inﬁxes. This latter category is rare in most
languages, one example is the Latin verb vinc¯ o where the
n denotes present tense. The addition of afﬁxes is called
inﬂection, while the inverse opereation is called lemmati-
zation.
Languages can be categorized into six main groups
based on their morphological features [1]. Analytic lan-
guages such as English have a ﬁx set of possible afﬁxes for
each part-of-speech category. Isolating languages like Chi-
nese and Vietnamese usually have words that are their own
stems, without any afﬁxes. Languages that have only a few
afﬁx types usually use auxiliary words and word position
to encode grammatical information. In intraﬂective lan-
guages (Arabic, Hebrew), consonants express the meaning
of words, while vowels add the grammatical meaning. Syn-
thetic languages have three subcategories: polysynthetic
languages like Native American languages contain com-
plicated words that are equivalent to sentences of other
languages; in fusional languages such as Russian, Polish,
Slovak, Czech, the morphemes are not easily distinguish-
able and often multiple grammatical relations are fused into
one afﬁx; agglutinative languages like Hungarian, Finnish,
Turkish have many afﬁx types and each word can contain a
large number of afﬁxes.
For different languages there are different models that
can be used to learn morphological rules, as morphology is
a language dependent area. Creating such models is a com-
plex task, especially for agglutinative languages. In the lit-
erature we can ﬁnd approaches that are based on sufﬁx trees
and error-driven learning [2] to optimally store transforma-
tion rules and search among them.
Hajic [3] proposed a generalized grammar model,
468 Informatica 43 (2019) 467–476 L. Kovács et al.
suitable for both the synthetic and agglutinative lan-
guages. The author introduces a controlled rewriting sys-
tem CRShA;V;K;t;Ri, where A is the alphabet, V is
the set of variables,K contains the grammatical meanings
(morphological categories), t maps the variables to types
and R is a set of atomic rewrite rules. The substitution
operation deﬁned in the rewrite rules replaces all variables
with some string, all instances of the same variable is re-
placed by the same string. The main parameters of an ele-
mentary substitution rule include the input state id, the out-
put state id, the variable id, the morphological category and
the resulting string. The article provides a formal frame-
work to describe the transformation process, but it does not
detail the rule generation process, since the model assumes
that the rule set is constructed by human experts.
In the two-level morphology model [4], the inﬂected
words are represented on two levels. The outer or sur-
face level contains the written form of the words, while
the inner or lexical level contains the morphological struc-
tures. For example, the surface level word ”tries” is related
to the lexical level ”try+s”. The lexical level represents
the morphological categories and separator symbols for the
surface form. The model uses a dictionary to store the valid
lemmas and morpheme categories. The transformation be-
tween the lexical level and the surface level is implemented
with a set of ﬁnite state transducers. A transducer is a spe-
cial automaton that can model the string transformations.
FSTs (ﬁnite state transducers) are widely used to man-
age morphological analysis for both generation and recog-
nition processes. One of the main issues related to this
model is the computational complexity of the implementa-
tions. It was shown that it is inefﬁcient to work with com-
plex morphological constraints [5], where there are com-
plex dependencies among the different morpheme units,
like vowel harmony. The analysis shows that both recog-
nition and generation are NP-hard problems. One of the
most widely known approaches to construct an FST is the
OSTIA method [6, 7]. It ﬁrst generates a preﬁx tree trans-
ducer, then merges all the possible states, pushes some out-
put elements toward the initial state and eliminates all the
non-deterministic elements.
The OSTIA algorithm was later improved by Gildea and
Jurafsky [8]. They extended the algorithm with a better
similarity alignment component. Theron and Cloete [9]
proposed a more general method based on edit-distance
similarities of the base and inﬂected words. The algorithm
learns the two-level transformation rules, calculating the
string edit difference between each source-target pair and
determining the edit sequences as a minimal acyclic ﬁnite
state automaton. The constructed automaton can segment
the target word into its constituent morphemes. The algo-
rithm determines the minimal discerning context for each
rule. This processing phase is done by comparing all the
possible contiguous contexts to determine the shortest con-
text.
Regarding current achievements, one important ap-
proach is presented in [10] and [11]. In the proposal of
Goldsmith, a simpliﬁed morphology model is used con-
taining substitution of sufﬁxes. The words are decomposed
into sets of short substrings, where the substrings have a
role similar to the morphemes. The proposed method uses
the concept of minimal description length to determine the
appropriate word segmentations.
Another popular and simple method is the so-called tree
of aligned sufﬁx rules (TASR) [12] that is a great match for
morphological rule induction: it can be built very quickly
according to previous evaluations and can be searched very
quickly as well, providing an outstanding correction ra-
tio. Unlike dictionary based systems and FSTs, the TASR
method can inﬂect even previously unseen words correctly.
The only downside of this model is that it can only han-
dle inﬂection rules that modify the end of the input word.
In Hungarian we must be able to describe not only sufﬁx
rules, but also preﬁx and inﬁx rules.
Besides trees, there are existing models that use lattice
structures to store transformation rules. The goal of [13]
is to optimize the lattice size by dropping rules that have
a small impact on the overall results. The rule model uses
similar concepts to the Levenshtein model like additions,
removals and replacements. The paper shows that this lat-
tice based model has a very promising memory constraint,
fast inﬂection time and a correctness ratio of almost 100%.
In this paper we present a novel model called Atomic
String Transformation Rule Assembler (ASTRA) whose
base concept is similar to TASR, but can handle any types
of afﬁxes, including preﬁxes, inﬁxes and sufﬁxes as well.
Our test language is Hungarian, a morphologically com-
plex, highly agglutinative language that is frequently tar-
geted by morphological model researchers due to its com-
plexity. In Hungarian, there are a high number of afﬁx
types that can form long afﬁx type chains, moreover each
afﬁx type can modify the base form signiﬁcantly, using
vowel gradation and changing consonant lengths. The in-
ﬂection rules of the language are complex, and there are
several exceptions, too. Besides morphological rule in-
duction, our model is capable of dealing with any other
string transformation based problems as well. Such prob-
lems can be found in the area of biological informatics (e.g.
investigating DNA sequences) and data mining (e.g. pre-
processing of data including spelling correction and data
cleaning).
The structure of this paper is the following:
– Section 2 introduces the reference methods: dictio-
nary based systems, ﬁnite state transducers, the tree
of aligned sufﬁx rules and the lattice based method.
– Section 3 describes the novel ASTRA method: its rule
model, training phase and inﬂection phase. We also
introduce three search algorithms to speed up inﬂec-
tion.
– The evaluation of the proposed method can be seen in
section 4. The four metrics we measure and compare
with the base methods are the training time, average
inﬂection time, size and correctness ratio.
String Transformation Based Morphology Learning Informatica 43 (2019) 467–476 469
– In section 5 we present a general application of the
ASTRA model.
2 Background
2.1 Dictionary Based Models
One of the most basic methods for learning inﬂection rules
is using dictionaries. A dictionary can be considered as a
D  W  W relation for morphological usage: for each
input word it can return an output word.
Usually dictionaries not only contain the inﬂected forms
of words, but also other semantic information like their
meaning, part-of-speech tag, sample sentences and so on.
There are many language dependent WordNet projects
[14, 15] whose goal is to build such databases. Besides au-
tomatic data mining techniques, these databases are often
validated and corrected by human experts.
Because of the large magnitude of data (the Hungarian
WordNet contains more than 40,000 synsets, i.e. word sets
with the same meaning), dictionaries can take much time
to build. Their advantage is that irregular morphological
forms are guaranteed to be retained, they aren’t dropped by
generalization techniques. Besides the training time, the
downside of dictionaries is the lack of generalization: other
automated methods usually can handle previously unseen
words, too, but dictionaries can only inﬂect and lemmatize
words they know.
2.2 Finite State Transducers
Finite state automaton (FSA) is the base model for ﬁnite
state transducers. An FSA is anA = hQ;   ;q
  ;E;Fi
whereQ is the ﬁnite set of states,   is the input alphabet,
q
  is the start state,E : Q    ! Q is the state transition
relation andF is the set of accepting states.
Finite state transducers (FST) [7] extend this model
with additional components, as well as with outputting
strings. There are multiple transducer models. A ratio-
nal transducer is aT =hQ;   ;   ;q
  ;Ei whereQ,   and
q
  are the same as for an FSA;   is the output alpha-
bet and E   (Q              Q) is the state transition
relation. In practice,   =   . A sequential transducer
is almost the same, except for two additional conditions:
E  (Q            Q) and8 (q;a;u;q
0
); (q;a;v;q
00
)2
E ) u = v and q
0
= q
00
. A subsequential transducer
is a special sequential transducer that has a sixth compo-
nent:   : Q!     that is the state output function. Such
transducer works in the following way: each input charac-
ter causes a state transition and the label of this transition
is appended to the output string. Finally, the ending state’s
output is also appended, resulting in the ﬁnal output string.
A transducer is onward if for every state, the state’s out-
put and the state transitions’ outputs starting from this state
have no common preﬁxes.
FSTs are used extensively while working with string
transformations, because they have optimal sizes and can
produce the output almost in constant time. However, as
we’ll see, with morphological applications, their general-
ization ability is not really usable.
2.3 Tree of Aligned Sufﬁx Rules
There are 3 main types of substrings that can change in a
word during inﬂection: preﬁxes, sufﬁxes and inﬁxes. The
substring pre2     ;jprej> 0 is a preﬁx of the strings
1
2
    if there exists another string s
2
2     such that s
1
=
pre +s
2
. Similarly, the substring suff2     ;jsuffj> 0 is a
sufﬁx of the strings
1
if there exists another strings
2
such
thats
1
=s
2
+ suff. The substring inf2     ;jinfj> 0 is an
inﬁx of the strings
1
if there exist two other stringss
2
;s
3
such thats
1
=s
2
+ inf +s
3
wherejs
2
j> 0 andjs
3
j> 0.
The TASR model can only work with morphological
rules that modify the end of the words, meaning that it can
only model sufﬁx transformations. This restriction is ac-
ceptable for morphologically simpler languages, but com-
plex agglutinative languages often contain preﬁx and inﬁx
transformation rules as well.
The goal of the TASR learning phase is to generate a set
of sufﬁx rules from a training word pair set. This set of
rules is denoted byR
T
=fR
T
g in this paper. A sufﬁx
rule consists of two components: R
T
= (  T
;  T
) where
  T
;  T
2     . Here,  T
contains the word-ending charac-
ters that are modiﬁed by the rule, and  T
contains the re-
placement characters. As an example, for the English verb
try whose past tense is tried, we can generate a sufﬁx rule
where  T
= y and  T
= ied.
The ruleR
T1
=f  T1
;  T1
g is aligned with ruleR
T2
=
f  T2
;  T2
g or shortlyR
T1
k R
T2
if8s
1
2     :9s
2
2     such thats
1
+  T1
=s
2
+  T2
ands
1
+  T1
=s
2
+  T2
. The
aligned-with operator is symmetric, so R
T1
k R
T2
()
R
T2
kR
T1
.
If we have a word pair, for example (try, tried) we can
generate multiple aligned sufﬁx rules. The minimal sufﬁx
rule is (y, ied), and after extending this rule with one char-
acter at a time, we get (ry, ried) and (try, tried).
We can deﬁne a frequency metric freq (R
T
jI) for
each rule R
T
based on the training word pair set I =
f(w
1
;w
2
)jw
1
;w
2
2Wg, counting the number of word
pairs for whichR
T
applies.
For every word pair in the training set, we must ﬁrst gen-
erate all the aligned sufﬁx rules according to the above def-
initions and insert these rules in a tree (T;  ). This tree
will consist of nodesn
T1
;n
T2
;:::;n
Tm
, each noden
Ti
as-
sociated with a set of rulesn
Ti
7!
  R
Tij
=
    Tij
;  Tij
   .
All the rules associated with the same node have the same
context.
Let’s have two nodes: n
T
#
and n
T
"
. They are asso-
ciated with the rules R
T
#i
=
    T
#i
;  T
#i
  and R
T
"j
=
    T
"j
;  T
"j
  , respectively. Then
T
#
node is the child ofn
T
"
or shortlyn
T
#
  n
T
"
if9x2   : 8i;j :  T
#i
=x +  T
"j
.
The root node and rules are denoted by n
T
*
7!
  R
T
*k
=
    T
*k
;  T
*k
   . For the root, the following con-
dition applies:8k :
      T
*k
    = min
ij
      Tij
    .
470 Informatica 43 (2019) 467–476 L. Kovács et al.
Child ruleR
T
#
=
    T
#
;  T
#
  is subsumed by parent rule
R
T
"
=
    T
"
;  T
"
  (R
T
#
< R
T
"
) if   T
#
= x +  T
"
and
  T
#
=x +  T
"
wherex2   .
After these deﬁnitions, we can deﬁne which rule is the
winning rule of node n
T
#
among the associated R
T
#i
=
    T
#i
;  T
#i
  rules. Let n
T
"
be the parent node with rules
R
T
"j
=
    T
"j
;  T
"j
  . The winner rule is
^
R
T
#
= R
T
#k
such that freq
  R
T
#k
jI
  = max
i
  freq
  R
T
#i
jI
   and
@j :R
T
"j
>R
T
#k
.
After that we can build the tree from the generated rules.
Typically the most general rules will be close to the root
node, while the most speciﬁc rules will be stored in the
leaves. Therefore, during inﬂection we can search the tree
in a bottom-up fashion, returning the winner rule of the ﬁrst
node we ﬁnd whose context matches the input word. Since
we start at the leaves, the ﬁrst matching rule will be the
most speciﬁc one, having the longest context. This means
that the resulting inﬂected form will mirror the main char-
acteristics of the training data.
2.4 Lattice Based Method
The rule model of the examined lattice based inﬂection
method [13] is a six-tuple R = ( ; ;!;
  !
 ;
    ; h  i
i),
where
–   2     is the preﬁx of the rule containing the charac-
ters before the changing part,
–   2     is the core of the rule that is the changing part,
– !2     is the postﬁx of the rule containing the char-
acters after the changing part,
–
  !
  2 N is the front index of the rule’s context occur-
rence in the source word,
–
     2 N is the back index of the rule’s context occur-
rence in the source word and
–h  i
i is a list of simple transformation steps on the core,
  i
        .
These rules are generated automatically from training
word pairs, then inserted into a lattice structure, where the
parent-child relationship is based on rule context contain-
ment. In the original paper we formalized multiple lattice
builder algorithms that tried to reduce the size of the result-
ing lattice. The best builder only inserts those rules and in-
tersections into the lattice that are really responsible for the
high correctness ratio, every other redundant rule is elimi-
nated.
As we’ll see, the size characteristics of this model is very
promising, but because of the high degree of generaliza-
tion, the lattice can inﬂect some words incorrectly. This
is due to the overgeneralization effect of the lattice model
itself.
3 Atomic string transformation rule
assembler
The goal of the Atomic String Transformation Rule As-
sembler (ASTRA) model is to collect atomic, elementary
patterns from a training word pair set during the training
phase, and use the best matching atomic rules for each
input word during the production phase. For these in-
puts, every matching, non-overlapping atomic rule is ap-
plied to produce the correct inﬂected form. As discussed
previously, using these concepts, the proposed method can
model preﬁx, inﬁx and sufﬁx inﬂection rules as well, thus
can be used for morphologically complex agglutinative lan-
guages.
First of all, we deﬁne an extended alphabet so that it is
easier to determine where a word starts and ends. Let’s
introduce two special characters, $ that will mark the start
of the word and # that will mark the end of the word. If a
rule’s context contains any of these two special characters,
it will be easier to determine if the beginning or the end of
the word needs to be transformed.
Of course these characters are not part of the original   alphabet. The extended alphabet will be denoted by
    =
  [f$; #g. We also deﬁne a new operator on strings that
prepends $ and appends # to the strings:  (w) =  w = $+
w + #. The inverse operation drops the special characters
from the input word:     1
(  w) = w. The set of extended
words is denoted by
  W .
The input of the training process for the new method is
the same set of word pairs containing the base form and
inﬂected form of the word, but the ﬁrst step of the algo-
rithm is to extend these word pairs with our new special
characters. After the extension, we get a new training set
 I =f(  w
1
;  w
2
)g.
We split each word pair to matching segments
 w
1
= 
1
1
 
2
1
::: 
k
1
 w
2
= 
1
2
 
2
2
::: 
k
2
A segment 
i
1
! 
i
2
is called variant if 
i
1
6= 
i
2
, otherwise
it is called invariant. In a segment decomposition, variant
and invariant segments are alternating.
As one word pair might have multiple segment decom-
positions, we need to select the best one among them. To
quantify the goodness of the decompositions, we use a
segment ﬁtness formula that returns how well-aligned the
 
i
1
! 
i
2
segment is:
  1
  1
index
max
  index
min
+  2
      
i
2
    where index
max
and index
min
are the maximal and mini-
mal indices of theith segment, i.e. the maximum and min-
imum of the indices
P
i  1
j=1
      
j
1
      and
P
i  1
j=1
      
j
2
      , respec-
tively. This formula encodes that invariant segments are
better if their components are longer and the two compo-
nents appear near to each other.
String Transformation Based Morphology Learning Informatica 43 (2019) 467–476 471
Example 3.1. Let us choose a training word pair (dob,
ledobott)
1
as an example to demonstrate the segment de-
composition algorithm. First, the words are extended with
the special characters: ($dob#, $ledobott#). One valid
segment decomposition is the following: ( 
1
1
= $, 
1
2
=
$le), ( 
2
1
= dob, 
2
2
= dob), ( 
3
1
= #, 
3
2
= ott#). The
middle segment is invariant, while the ﬁrst and last ones
are variant segments.
For each variant segment, we can deﬁne so-called atomic
rules in the form of R
A
= (  A
;  A
;  A
;!
A
) where   A
is the preﬁx and !
A
is the sufﬁx. The rule context that
must be searched in the input words later is   A
(R
A
) =
  A
+  A
+!
A
. We can see that with this rule model, not
only sufﬁx rules can be modelled, because of the new  A
and!
A
components.
Let’s take a variant segment 
i
1
! 
i
2
. First,
we need to deﬁne the core atomic rule R
Aic
=
(  Aic
;  Aic
;  Aic
;!
Aic
) for this segment that has no pre-
ﬁx or postﬁx, i.e.j  Aic
j = 0,  Aic
= 
i
1
,  Aic
= 
i
2
and
j!
Aic
j = 0.
Then, we can extend this core atomic rule with one char-
acter at a time on the left and right sides, symmetrically.
Let’s assume that
P
i  1
j=1
      
j
1
      =n,
P
k
j=i+1
      
j
1
      =m and
    
i
1
    = l. In this case, the extended rule candidates are
R
Aij
=
    Aij
;  Aij
;  Aij
;!
Aij
  with the following com-
ponents (81  j  minfn;mg):
  Aij
=  w
1
[n + 1  j; n]
  Aij
= 
i
1
  Aij
= 
i
2
!
Aij
=  w
1
[n +l + 1; n +l +j]
Here,w [i;j] denotes the substring ofw from theith to
thejth character.
To make the generated atomic rules unambiguous, we
have to make sure that the context of the rules only ap-
pear once in the base form of the word (  w
1
). Every atomic
rule candidate whose context appears more than once in the
base form of the word is dropped from the ﬁnal set.
Example 3.2. Using the winning segmentation of example
3.1, the following atomic rules can be generated from the
word pair (dob, ledobott): (  ; $; $le;  ), (  ; $; $le;d),
(  ; $; $le;do), (  ; $; $le;dob), (  ; $; $le;dob#),
(  ; #;ott#;  ), (b; #;ott#;  ), (ob; #;ott#;  ),
(dob; #;ott#;  ), ($dob; #;ott#;  ).
Transforming a word  w 2
  W using the atomic rule
R
A
= (  A
;  A
;  A
;!
A
) can be deﬁned as
  A
(R
A
;  w) =
(
 w if  A
(R
A
)6   w; or
 wn  A
(R
A
) [  A
!  A
]
where  wn   A
(R
A
) [  A
!  A
] means that we need to
search  A
(R
A
) in  w, and replace  A
with  A
.
1
Hungarian for (throw, threw down). Note that we add two afﬁxes, one
for the past tense and one preverb for down.
The base form of the method doesn’t require to build a
tree, we can simply group the atomic rules based on their
contexts. A rule group is deﬁned as a set of atomic rules
  A
=fR
Ai
= (  Ai
;  Ai
;  Ai
;!
Ai
)g where8R
Ai
;R
Aj
2
  A
:  A
(R
Ai
) =  A
  R
Aj
  . The context of the rule group
is  A
( 
A
) =  A
(R
A
)8R
A
2   A
.
Example 3.3. For the atomic rules of example 3.2, we
can produce nine different rule groups, each contain-
ing a single atomic rule except for the rule group with
context $dob# that contains both (  ; $; $le;dob#) and
($dob; #;ott#;  ).
The goal of the training phase is to produce a set of rule
groupsR
A
=f  A
g based on the training word pair set
 I. The generated atomic rule set can be used to inﬂect the
given input words based on the training word pair set. For
each input, our goal is to choose some atomic rules that
match the input word. Rules with longer matching sub-
strings in the input word are better than rules with shorter
matching substrings. The ﬁtness function is
f (R
A
j  w) =
j  (R
A
)j
j  wj
    k
(  (R
A
);  w)
wherek is a parameter and the  function returns how simi-
lar the rule context is to the input word. To simplify things,
we used k = 1 and a discrete   function that returns 1 if
  (R
A
)   w, and 0 otherwise.
Using this ﬁtness function, we can choose the ﬁrst n
atomic rules that are best suited for the given input word
where n is a parameter. We implemented three separate
candidate selector algorithms. The ﬁrst one is a sequential
algorithm that processes each rule group one by one. If
a rule group’s context matches the input word, its atomic
rules are added to the resulting set of candidate rules. The
second one is a parallel algorithm that does the same thing
in a divide and conquer manner, processing the rule groups
in parallel. The number of threads depends on the num-
ber of our CPU cores. The third one uses a preﬁx tree
that is built from the rule groups during the training phase.
With the preﬁx tree, we can speed up the candidate search
process by searching substrings of the input words. If a
substring is found in the preﬁx tree, the appropriate rule
group’s atomic rules are added to the resulting set.
Since there might be multiple overlapping rule candi-
dates that would transform the same substring of the word
leading to ambiguity, among these rules only the ﬁrst one
is used, the others are dropped. After we chose the best
non-overlapping rules, we can apply them one by one on
the input word, producing its inﬂected form.
4 Evaluation of the proposed
method
For evaluation purposes, we used a training word pair set
generated by [16]. We chose the Hungarian accusative case
472 Informatica 43 (2019) 467–476 L. Kovács et al.
1e-02
1e-01
1e+00
1e+01
1e+02
0 2500 5000 7500 10000
Number of Training Word Pairs
Training Time [s]
Dict
FST
TASR
Lattice
ASTRA
ASTRA+Pref
(a) Training time of the methods
1e-06
1e-04
1e-02
0 2500 5000 7500 10000
Number of Training Word Pairs
Average Inflection Time [s]
Dict
FST
TASR
Lattice
ASTRA+Seq
ASTRA+Par
ASTRA+Pref
(b) Average inﬂection time of the methods
Figure 1: Training time and average inﬂection time of the methods
as our target afﬁx type and used up to 10,000 training word
pairs.
We compared a custom dictionary implementation,
Lucene’s FST method, the TASR model, the previously
mentioned lattice based method and the proposed ASTRA
method, measuring their training times, their average in-
ﬂection times, the sizes of their rule base and their correct-
ness ratios, i.e. how much percent of evaluation words are
inﬂected correctly after the training phase. IfW
+
is the set
of evaluation words for which the model yields a correct in-
ﬂected form, andW
  is the set of failed evaluation words,
then the correctness ratio is W
+
= (W
+
+W
  ). Where
applicable, we also measured the differences using the se-
quential, parallel or preﬁx tree search algorithm in case of
ASTRA.
In Figure 1a we can see the training time of the methods,
using logarithmic scale for the y axis. As we can see, there
are three different clusters based on the training time. The
fastest solution is to store the already available set of word
pairs in a dictionary, because we only have to store these
records, no extra processing occurs. Building an FST is the
next in line, but it has very similar characteristics to the AS-
TRA method. If we include the preﬁx tree building as well,
the ASTRA’s training time increases a bit. The third clus-
ter consists of the TASR and the lattice based methods. It
can be seen that building a tree of aligned sufﬁx rules takes
more time as the previous methods, and the complexity of
the lattice adds even more time to the TASR’s results.
Figure 1b shows the average inﬂection time of the meth-
ods. As we can expect, if we use an appropriate hash func-
tion in the dictionary implementation, retrieving the match-
ing record for each input word becomes almost constant
in time. The second best method as for average inﬂection
time is the FST: it also has a very plain curve, but it’s a
bit higher than the dictionary’s. ASTRA with a preﬁx tree
comes next, but it’s very close to the line of the lattice based
method. The remaining methods have much steeper curves:
TASR comes next, but the parallel search function with AS-
TRA is very close to it; while the worst inﬂection time is
achieved by the sequential search function. Note that al-
though the inﬂection time of the preﬁx tree search variant
is the best for ASTRA, it means a bit overhead during the
training time. However, even with this overhead, we can
say that it’s worth using it.
In Figure 2 we can see the overall size of the rule bases,
i.e. the number of word pairs in the dictionary, states in the
FST, nodes in the TASR and the lattice, and atomic rules in
case of ASTRA.
It is not surprising that there are more generated atomic
rules in ASTRA than nodes in the tree of aligned sufﬁx
rules, since the atomic rule deﬁnition allows to have mul-
tiple variant segments in a word pair and from these vari-
ant segments, multiple core and extended atomic rules can
be produced. On the other hand, TASR will only generate
one minimal sufﬁx rule per word pair and all of its aligned
extensions. The advantage of the ASTRA model is that
even with this higher number of rules and the preﬁx tree,
we can train it faster than a TASR. Moreover it can cover
more cases, including preﬁx, inﬁx and sufﬁx rules. The
built FST has better size characteristics, because its builder
algorithm merges every state that can be merged without
losing information from the original training word pair set.
It can be seen from the line of the dictionary that the num-
ber of states in an FST and the number of rules in the AS-
String Transformation Based Morphology Learning Informatica 43 (2019) 467–476 473
0
20000
40000
60000
0 2500 5000 7500 10000
Number of Training Word Pairs
Size
Dict FST TASR Lattice ASTRA
Figure 2: Size of the rule bases
TRA and TASR are higher than the number of input word
pairs. However, the minimal lattice builder algorithm pro-
duces an even better lattice size, as the number of nodes in
the resulting lattice is lower than the size of all the other
structures.
Finally, Figures 3a and 3b show the correctness ratio of
the models. The results of the left side were achieved by
using disjoint training and evaluation word pair sets. We
can see that the correctness ratio platoes a bit below 95%
for TASR and ASTRA, the latter one performing a bit bet-
ter. It can be also seen that the lattice based method is
worse, probably because of its higher degree of generaliza-
tion. When we examined the results of the lattice compared
to TASR and ASTRA, we saw that in multiple cases the lat-
tice found a node whose rule resulted in an invalid inﬂected
form. The correctness ratio of the dictionary and the FST
is 0%, because they could not generalize at all. For the dic-
tionary, it is understandable, because a dictionary is a static
map of word pairs. On the other hand, although an FST
can generalize, these types of morphological applications
don’t beneﬁt from this generalization, as the generalized
transformations do not result in real inﬂection rules.
On the right side of the ﬁgure, we can see what happens
if we use the ﬁrst 100, 200, . . . 10,000 word pairs to train
the methods, and then use the same 10,000 word pairs for
evaluation. All the methods have an almost 100% correct-
ness ratio at the end of the diagram. The only reason that
we cannot reach 100% is that in the training word pair set
there are records with the same lemma and different in-
ﬂected forms such as örömöt and örömet that are two valid
inﬂected forms of the Hungarian word öröm (joy in En-
glish). The difference resides in the characteristics of the
curves. The dictionary and the FST cannot really gener-
alize inﬂection rules, so their lines are linear. The other
methods can reach higher percentages more quickly, but
as we can see, the ASTRA method is even better than the
TASR in that it can produce a better correctness ratio with
a smaller number of training word pairs. The lattice based
method is worse than TASR and ASTRA in this case as
well.
5 General application of the ASTRA
model
One of the scientiﬁc areas of applying string algorithms
including string transformation based methods is the area
of bioinformatics and computational biology [17]. DNA
sequences are modelled using strings of four characters
matching the four types of bases: adenine (A), thymine (T),
guanine (G) and cytosine (C). One of the goals of bioinfor-
maics is to compare genes in DNAs to ﬁnd regions that are
important, ﬁnd out which region is responsible for what
functions and features and determine how genetic informa-
tion is encoded. The process of DNA analysis is a very
computational intensive task, that’s why modelling, statis-
tical algorithms and mathematical techniques are important
aspects of success.
Besides applying string transformations, computational
biology uses many string matching and comparison tech-
niques as well [18]. Finding the longest matching sub-
strings of two strings (DNA sequences) helps in ﬁnding the
best DNA alignments and thus comparing different DNA
sequences, ﬁnding matching parts and differences. One
of the techniques used for this comparison is the applica-
tion of the edit distance computation originally published
by Levenshtein [19] for morphological analysis.
Another application area where string transformation
based methods are applied is data mining. Data mining en-
gines usually consist of multiple phases to extract informa-
tion out of unannotated training data such as long free texts.
The ﬁrst phase is often called data cleaning, where the raw
input data is preprocessed so that invalid records are either
removed or ﬁxed before moving on with the data mining
algorithms. One way to ﬁx the typos and other errors in
free texts is spelling correction. Spelling correction can be
interpreted as learning those string transformations that can
transform an unknown word containing typos to the clos-
est known word. There are multiple techniques to solve
this problem, usually iterative algorithms perform better as
there can be multiple problems with a word that are eas-
ier to ﬁx in multiple steps [20]. The goal is to ﬁnd a word
w 2 W for any unknown string s so that their distance
d (w;s)<  is lower than an acceptable threshold.
A third, more intuitive non-morphological application of
the ASTRA model is character sorting. Let’s have a ran-
dom strings2     with a given length ofjsj =n. The goal
is to rearrange the characters ins so that for each indexi,
1  i <jsj, s
i
  s
i+1
for a given partial ordering, for
example lexicographic ordering.
For our evaluation, we used input lengths 100, 200, . . . ,
3000. For each input length, we generated a random string
and applied a pre-trained ASTRA on the string incremen-
tally until the output was equal to the input. Then we
474 Informatica 43 (2019) 467–476 L. Kovács et al.
0
25
50
75
100
0 2500 5000 7500 10000
Number of Training and Evaluation Word Pairs
Correctness Ratio [%]
Dict/FST TASR Lattice ASTRA
(a) Disjoint training and evaluation data sets
0
25
50
75
100
0 2500 5000 7500 10000
Number of Training and Evaluation Word Pairs
Correctness Ratio [%]
Dict/FST TASR Lattice ASTRA
(b) Common training and evaluation data set
Figure 3: Correctness ratio with disjoint and common training and evaluation data sets
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
30
60
90
120
0 1000 2000 3000
Input Size
Inflection Cycles
Sort Time [s]
Inflection Cycles Sort Time [s]
Figure 4: Inﬂection Cycles
checked if the ﬁnal output contained the expected ordering,
and found that all of the results were correct.
For the training process of the ASTRA, we generated a
training word pair set. Each word pair contained a neces-
sary transformation as the core, such as (ba, ab), (ca, ac),
. . . , (zy, yz). To make the rules more noisy, we also gen-
erated a random string of 10 characters and prepended and
appended it to both words in the word pair. For each word
pair, this random preﬁx-sufﬁx part was different. The re-
sults were all correct. The number of required iterations
and the sorting time is displayed in Figure 4.
Unlike ASTRA, the other examined methods could not
sort the characters correctly. The dictionary and FST meth-
ods, as we saw previously, cannot be used for inputs that
are not present in the training word pairs set. TASR can
only transform inputs that should be modiﬁed at the end.
The lattice based method’s disadvantage in this case is that
it is not position agnostic, therefore it cannot determine
the atomic transformations necessary for sorting the char-
acters.
6 Conclusion
In this paper we presented the novel ASTRA model. The
motivation was that although the TASR method can han-
dle sufﬁx morphological rules extremely well, it cannot
describe rules modifying the beginning or the middle of
words. In the target language of our research, Hungar-
ian, there are a few afﬁx types that have preﬁx inﬂection
rules. The proposed rule model contains multiple com-
ponents to not only store the changing part of the word,
but also its preceding and following characters. We also
deﬁned a novel training algorithm that can generate such
rules and store them in rule groups. A ﬁtness function was
deﬁned that helps us choose the best rules from the rule
database for each input word and make sure we can pro-
duce the inﬂected form easily. Finally, we implemented
three search algorithms: one sequential, one parallel and
one preﬁx tree based search function. We evaluated the
proposed method, comparing its training time, average in-
ﬂection time, size and correctness ratio with the same met-
rics of some base models, including a dictionary based sys-
tem, Lucene’s FST implementation, the TASR method and
a lattice based model. The training time of ASTRA is ex-
String Transformation Based Morphology Learning Informatica 43 (2019) 467–476 475
ceptional, only the dictionary’s and FST’s training times
are better, even if we also build a preﬁx tree from the gen-
erated rules. The same can be said about the average inﬂec-
tion times. The size of ASTRA is the worst compared to
the other methods, but this is not really a problem, because
the inﬂection time does not get worse, and we can handle
more general inﬂection rules. The correctness ratio is also
exceptional, moreover it reaches higher percentages even
with less knowledge, i.e. fewer training word pairs than for
example the TASR method. Besides these metrics, the ad-
vantage of the proposed novel ASTRA method is that it can
be used not only for morphological rule induction, but also
for any types of problems that can be modelled with string
transformations. To demonstrate this, we adapted ASTRA
to a character sorting problem with a correction ratio of
100%.
Aknowledgement
The described article/presentation/study was carried out as
part of the EFOP-3.6.1-16-00011 ”Younger and Renewing
University - Innovative Knowledge City - institutional de-
velopment of the University of Miskolc aiming at intelli-
gent specialisation” project implemented in the framework
of the Szechenyi 2020 program. The realization of this
project is supported by the European Union, co-ﬁnanced
by the European Social Fund.
References
[1] A. Gelbukh, M. Alexandrov and S.-Y . Han (2004)
Detecting inﬂection patterns in natural language by
minimization of morphological model, Iberoamer-
ican Congress on Pattern Recognition, Springer,
pp. 432—438. https://doi.org/10.1007/
978-3-540-30463-0_54
[2] G. Satta and J. C. Henderson (1997) String trans-
formation learning, In Proceedings of the 35th An-
nual Meeting of the Association for Computational
Linguistics and Eighth Conference of the European
Chapter of the Association for Computational Lin-
guistics, Stroudsburg, PA, USA, pp. 444-–451.
[3] J. Hajic (1988) Formal morphology, In Proceedings
of International Conference on Computational Lin-
guistics, pp. 223–229. https://doi.org/10.
3115/991635.991680
[4] K. Koskenniemi (1983) Two-level morphology: A
General Computational Model for Word-Form Recog-
nition and Production, Department of General Lin-
guistics, University of Helsinki, Finland.
[5] L. Bauer (2003) Introducing linguistic morphology,
Edinburgh University Press.
[6] J. Oncina, P. García and E. Vidal (1993) Learn-
ing subsequential transducers for pattern recogni-
tion interpretation tasks, IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, V olume 15,
Number 5, pp. 448–458.https://doi.org/10.
1109/34.211465
[7] C. De la Higuera (2010) Grammatical infer-
ence: learning automata and grammars, Cam-
bridge University Press.https://doi.org/10.
1017/CBO9781139194655
[8] D. Gildea and D. Jurafsky (1995) Automatic Induc-
tion of Finite State Transducers for Simple Phonolog-
ical Rules, Proceedings of the 33rd Annual Meeting
on Association for Computational Linguistics, Cam-
bridge, Massachusetts, pp. 9–15.
[9] P. Theron and I. Cloete (1997) Automatic acquisition
of two-level morphological rules, Proceedings of the
ﬁfth conference on Applied natural language process-
ing, pp. 103–110.
[10] J. Goldsmith (2006) An algorithm for the un-
supervised learning of morphology, Natural
Language Engineering, V olume 12, Number 4,
pp. 353–371. https://doi.org/10.1017/
S1351324905004055
[11] J. Lee and J. Goldsmith (2016) Linguistica 5: Un-
supervised Learning of Linguistic Structure, HLT-
NAACL Demos, pp. 22–26.
[12] K. Shalonova and P. Flach (2007) Morphology learn-
ing using tree of aligned sufﬁx rules, In ICML Work-
shop: Challenges and Applications of Grammar In-
duction.
[13] G. Szabó and L. Kovács (2018) Lattice based mor-
phological rule induction, Acta Universitas Apulen-
sis, Number 53, pp. 93–110.https://doi.org/
10.17114/j.aua.2018.53.07
[14] C. Fellbaum (1998) WordNet: An electronic lexical
database, MIT Press, Cambridge.
[15] M. Miháltz, Cs. Hatvani, J. Kuti, Gy. Szarvas, J.
Csirik, G. Prószéky and T. Váradi (2007) Methods
and results of the Hungarian WordNet project, In Pro-
ceedings of GWC 2008: 4th Global WordNet Confer-
ence, University of Szeged, pp. 311–320.
[16] G. Szabó and L. Kovács (2015) Efﬁciency analysis
of inﬂection rule induction, In Proceedings of the
2015 16th International Carpathian Control Confer-
ence (ICCC), IEEE, pp. 521-–525.
[17] N. C. Jones (2004) An introduction to bioinformatics
algorithms, MIT press.
476 Informatica 43 (2019) 467–476 L. Kovács et al.
[18] E. Mourad and Z. Y . Albert (2011) Algorithms in
computational molecular biology: techniques, ap-
proaches and applications, V olume 21, John Wiley
& Sons.
[19] V . I. Levenshtein (1966) Binary codes capable of
correcting deletions, insertions, and reversals, Soviet
physics doklady, V olume 10, Number 8, pp. 707–710.
[20] S. Cucerzan and E. Brill (2004) Spelling Correction
as an Iterative Process that Exploits the Collective
Knowledge of Web Users, EMNLP, V olume 4, pp.
293–300.