https://doi.or g/10.31449/inf.v47i3.4761 Informatica 47 (2023) 349–360 349
Khmer -V ietnamese Neural Machine T ranslation Impr ovement Using Data
Augmentation Strategies
Thai Nguyen Quoc
1
, Huong Le Thanh
1, ∗ and Hanh Pham V an
2
1
School of Information and Communication T echnology , Hanoi University of Science and T echnology , V ietnam
2
FPT AI Center
E-mail: thai.nq212642m@sis.hust.edu.vn, huonglt@soict.hust.edu.vn, hanhpv@fsoft.com.vn
Keywords: machine translation, data augmentation, low-resource, Khmer -V ietnamese
Received: March 22, 2023
The development of neural models has gr eatly impr oved the performance of machine translation, but these
methods r equir e lar ge-scale parallel data, which can be difficult to obtain for low-r esour ce language pairs.
T o addr ess this issue, this r esear ch employs a pr e-trained multilingual model and fine-tunes it by using
a small bilingual dataset. Additionally , two data-augmentation strategies ar e pr oposed to generate new
training data: (i) back-translation with the dataset fr om the sour ce language; (ii) data augmentation via the
English pivot language. The pr oposed appr oach is applied to the Khmer -V ietnamese machine translation.
Experimental r esults show that our pr oposed appr oach outperforms the Google T ranslator model by 5.3%
in terms of BLEU scor e on a test set of 2,000 Khmer -V ietnamese sentence pairs.
Povzetek: Raziskava uporablja pr edhodno usposobljen večjezični model in povečanje podatkov . Rezultati
pr esegajo Google T ranslator za 5,3%.
1 Intr oduction
Machine translation (MT) is the task of automatically
translating text from one language to another . There are
three common approaches to MT : rule-based approach
[ 1 ], statistical-based approach [ 2 , 3 ], and neural-based one
[ 4 , 5 , 6 ]. The rule-based approach depends on translation
rules and dictionaries created by human experts. Statisti-
cal Machine T ranslation (SMT) relies on techniques like
word alignment and language modeling to optimize the
translation process. While SMT can handle a wide range
of languages and translation scenarios, it often struggles
with capturing complex linguistic phenomena and handling
long-range dependencies. W ith significant advancements
in deep learning, Neural Machine T ranslation (NMT) ap-
proaches have shown great potential and have replaced
SMT as the primary approach to MT . NMT models capture
contextual information, handle word reordering, and gener -
ate fluent and natural translations. NMT has gained popu-
larity due to its end-to-end learning, ability to handle com-
plex linguistic phenomena, and improved translation qual-
ity . Among all NMT systems, transformer -based MT mod-
els [ 7 , 8 ] have demonstrated superior performance. The key
feature of transformer models [ 8 ] is their attention mecha-
nism, which allows them to ef fectively capture dependen-
cies between dif ferent words in a sentence. Unlike tradi-
tional recurrent neural networks that process words sequen-
tially , transformers can consider the entire input sentence
simultaneously . This parallelization significantly speeds up
the training process and makes transformers more ef ficient
for long-range dependencies.
One notable limitation of NMT techniques pertains to
their reliance on a substantial number of parallel sentence
pairs to facilitate model training. Unfortunately , most of
language pairs in the world are lack of such a lar ge dataset.
Consequently , these language pairs fall under the category
of low-resource MT , presenting a challenging scenario for
the application of neural-based models in this domain.
Several works have carried out research to solve the low-
resource problem in NMT . Chen et al.[ 9 ], Kim et al. [ 10 ]
dealt with the low-resource NMT by using pivot transla-
tions, where one or more pivot languages were selected
as a bridge between the source and tar get languages. The
source-pivot and the pivot-tar get should be rich-resource
language pairs. Sennric et al. [ 1 1 ], Zhang [ 12 ] applied the
forward/backward translation approaches to generate paral-
lel sentence pairs by translating the monolingual sentences
to the tar get/source language via a translation system. Then,
the pseudo parallel data was mixed with the original parallel
data to train an NMT model. A problem in this approach is
how to control the quality of the pseudo parallel dataset in
order to improve the performance of the low-resource NMT
system.
Since NMT requires the capability of both language
understanding (e.g., NMT encoder) and generation (e.g.,
NMT decoder), pre-training language model can be very
helpful for NMT , especially low-resource NMT . T o do this
task, BAR T model [ 13 ] has been proposed to add noises
and randomly masked some tokens in the input sentences
in the encoder , and learn to reconstruct the original text in
the decoder . T5 model [ 14 ] randomly masks some tokens
and replace the consecutive tokens with a single sentinel
350 Informatica 47 (2023) 349–360 T . N. Quoc et al.
token.
T o address the low-resource problem in NMT , we pro-
pose to fine-tune mBAR T [ 15 ] - a pretrained multilingual
Bidirectional and Auto-Regressive T ransformers model
that has been specifically designed for multilingual appli-
cations, including MT . The fine-tuning process is com-
bined with several strategies, including the utilization of
back-translation techniques [ 1 1 ] and data augmentation via
a pivot language. W e propose several data augmentation
strategies to augment training data as well as controlling
the data quality .
Our proposed approach can be applied to any low-
resource language pairs. However , in this research, we
evaluate our approach by implementing it with the low-
resource Khmer -V ietnamese (Km-V i) language pair , using
a dataset with 142,000 parallel sentence pairs from Nguyen
et al. [ 16 ]. As far as we know , there is only two works
dealing with the Km-V i machine translation ([ 17 ], [ 18 ]).
Nguyen et al. [ 17 ] presented an open-source MT toolkit
for low-resource language pairs. However , this approach
only used a transformer architecture to train the original
dataset, without applying fine-tuning, transfer learning, or
additional data augmentation techniques. Pham and Le [ 18 ]
fine-tuned mBAR T and applied some data augmentation
strategies. In this research, we have extended the work in
[ 18 ] to improve the performance of the Km-V i NMT sys-
tem. The contributions are as follows:
– W e propose new methods for data selection based
on sentence-level cosine similarity through the bi-
encoder model [ 19 ] combined with the TF-IDF score.
– W e suggest a data generation strategy to generate best
candidates for the synthetic parallel dataset.
– T o control the quality of augmented data, we propose
an “aligned” version to enrich the data and a two-
step filtering to eliminate low quality parallel sentence
pairs.
The remainder of this paper is or ganized as follows. Sec-
tion 2 analyzes various techniques in existing research to
address the limitations of low-resource NMT . Section 3
describes our system diagram. Our proposed data aug-
mentation strategies are outlined in Section 4 . Section 5
elaborates on the experimental design, whereas Section 6
presents an analysis of the empirical outcomes. Finally ,
Section 7 concludes our paper .
2 Related work
Pr etrain Language Models (PLMs) have proven to be
helpful instruments in the context of low-resource NMT .
Literature has shown that low-resource NMT models can
benefit from the use of a single PLM [ 20 , 21 ] or a multilin-
gual one [ 13 ]. The multilingual PLM is claimed to facili-
tate more ef fective learning of the connection between the
source and the tar get representations for translation. These
transfer learning methods leverage rich-resource language
pairs to train the system, then fine-tune all the parameters
on the specific tar get language pair [ 22 ]. The rich-resource
language pairs should be in a similar language family to the
low-resource ones, to have good results.
Data augmentation is the method of generating ad-
ditional data, achieved by expanding the original dataset
or integrating supplementary data from relevant sources.
V arious approaches to data augmentation have been ex-
plored, including: (i) paraphrasing and sentence simplifica-
tion [ 23 ], (ii) word substitution and deletion [ 24 , 25 ], (iii)
limited and constrained word order permutation [ 26 ], (iv)
domain adaptation [ 27 ], (v) back-translation [ 1 1 ], and (vi)
data augmentation via a pivot language [ 28 ].
Paraphrasing and sentence simplification [ 23 ] of fer var -
ied quality , with a risk of introducing semantic changes or
losing important information. W ord substitution [ 24 ] re-
quires careful selection of synonyms to maintain accuracy ,
while word deletion [ 25 ] can introduce noise and requires
ef fective training to handle missing information. Limited
and constrained word order permutation [ 26 ] suits language
pairs with word order variations but requires defining com-
plex constraints based on language characteristics. Domain
adaptation [ 27 ] addresses the challenge of domain-specific
low-resource machine translation, which is not the tar get
of this research. Back-translation [ 1 1 ] has proven success-
ful by generating synthetic source sentences through trans-
lating tar get sentences. However , this approach carries a
risk of errors due to imperfections in pre-trained translation
models. On the other hand, the pivot-based approach [ 28 ]
involves translating low-resource language pairs through
a high-resource language. This approach relies on good
translation quality to and from the pivot language.
Back-translation and pivot-based translation are consid-
ered reliable and generalizable approaches when comple-
mented by ef fective post-processing methods for filtering
low-quality data. Therefore, this paper specifically concen-
trates on utilizing back-translation and pivot-based transla-
tion as the selected methods for data augmentation. T o im-
prove the quality of the synthetic parallel data generated by
these methods, two strategies are employed: (i) data selec-
tion and (ii) synthetic data filtering.
Data selection is the process of ranking and selecting a
subset from a tar get monolingual dataset that ensures in-
domain as the training data. The objective of this process
is to improve the performance of an NMT system for a par -
ticular domain. V arious techniques for data selection have
been proposed in the literature, such as computing sentence
scores based on Cross-Entropy Dif ference (CED) [ 29 , 30 ],
and using representation vectors to rank sentences in the
monolingual dataset [ 31 , 32 ]. Three data selection methods
had been implemented by Silva et al. [ 33 ], namely CED,
TF-IDF , and Feature Decay Algorithms (FDA) [ 34 ]. The
experimental results pointed out that the TF-IDF method
gained the best improvements in both BLEU and TER
(T ranslation Error Rate) scores.
Synthetic data filtering T o filter out low-quality sen-
Khmer -V ietnamese Neural Machine T ranslation… Informatica 47 (2023) 349–360 351
tence pairs, Imankulova et al. [ 35 ] proposed a method
based on the BLEU measure. This method involves lever -
aging a source-to-tar get NMT model to translate the syn-
thetic source sentences into synthetic tar get sentences. Sub-
sequently , the sentence-level BLEU score is calculated for
each sentence pair between the synthetic tar get sentence
and the tar get sentence, with the ultimate objective of ex-
cluding low-score sentences. Koehn et al. [ 36 ] proposed
another approach based on the sentence-level cosine simi-
larity of two sentences. However , their proposal required
an ef fective acquisition of the linear mapping relationship
between the two embedding spaces of the source language
and the tar get one.
Another way to improve translation quality is by us-
ing data augmentation methods via a pivot language [ 28 ].
This method involves translating sentences from the source
language to the pivot one using the source-pivot transla-
tion model, followed by translating sentences in the pivot
language to sentences in the tar get language. However ,
there are certain restrictions associated with this technique.
Firstly , the circular translation process increases the decod-
ing time during inference as it can iterate through multiple
languages to obtain the desired quality . Secondly , transla-
tion errors may arise in each step, which can lead to low-
quality translation of the sentence in the tar get language.
In this paper , we introduce an approach aimed at enhanc-
ing the performance of low-resource MT . Our approach in-
corporates multiple data augmentation strategies alongside
various data filtering methods to improve the quality of syn-
thetic data. In the subsequent sections, we introduce these
methods in detail.
3 Our system diagram
As previously mentioned, our goal is to propose strategies
that can improve the performance of low-resource NMT
systems. The proposed approach will be applied for the
Km-V i language pair . T o do this, we first fine-tune the
mBAR T50 [ 37 ] model with the Km-V i bilingual dataset.
The mBAR T model Multilingual BAR T (mBAR T) [ 15 ]
is a sequence-to-sequence denoising auto-encoder that was
pre-trained on lar ge-scale monolingual corpora in many
languages using the BAR T objective [ 13 ]. The pre-trained
task is to reconstruct the original text from the noise one, us-
ing two types of noise: random span masking and order per -
mutation. A special variant of mBAR T called mBAR T50
[ 37 ], has been trained in 50 languages, including Khmer
and V ietnamese. Nonetheless, the mBAR T50’ s translation
quality of the Km-V i language pairs is low . T o deal with
this problem, we propose to fine-tune the mBAR T50 with
the Km-V i bilingual dataset combined with the augmented
dataset through several strategies.
Our proposed Khmer -V ietnamese MT system model is
described in Figure 1 , which incorporates two strategies
for data augmentation: (i) back-translation with a dataset
in the tar get language; and (ii) data augmentation via En-
glish pivot language. These strategies will be introduced in
the next section.
4 Data augmentation strategies
Since the word orders and theirs meaning in machine trans-
lation are important, methods such as paraphrasing, sim-
plification, limited and constrained word order permutation
cannot provide good parallel sentence pairs.
4.1 Back-translation with a dataset in the
target language
Back-translation method proposed by Senrich et al [ 1 1 ] is
an useful way to generate additional training data for low-
resource NMT . This method leverages an external dataset in
the tar get language, termed the ”tar get-language dataset”. It
employs a tar get-to-source NMT model, trained on the orig-
inal bilingual dataset, to translate this dataset into the source
language. The resulting translated sentences are then com-
bined with their corresponding tar get sentences, creating a
synthetic bilingual dataset. However , the dataset’ s quality
generated by this method is not guaranteed. T o address this
issue, we improve this method by integrating data filtering
techniques to the back-translation process. Our proposed
method is conducted in three steps as follow:
– Step 1 - Data selection: Rank and select sentences
from a tar get-language dataset that is in the same do-
main as sentences in the original bilingual dataset.
– Step 2 - Data generation: Each sentence from the
output dataset in Step 1 is translated to k sentential
candidates in the source language using the tar get-
to-source NMT model which has been trained by
fine-tuning the mBAR T50 with the original bilingual
dataset.
– Step 3 - Data filtering: Filter out low-quality bilin-
gual sentence pairs in the synthetic parallel dataset.
W e will discuss these three steps in the following sec-
tions.
4.1.1 Data selection
For a given dataset D consisting of T sentence pairs in a
specific domain, and a set of sentences in a general domain
G , the aim of data selection is to rank the sentences in G
based on their similarity to the domain ofD , then selecting
highest-ranked sentences to form a subset that shares the
same domain as D . Given that TF-IDF is a popular tech-
nique used to identify representative words for a dataset,
we can assess whether sentences in G belong to the same
domain asD using this measure. In addition to the TF-IDF
measure, cosine similarity can be employed to measure the
semantic similarity between two sentences based on their
352 Informatica 47 (2023) 349–360 T . N. Quoc et al.
Figure 1: Our proposed Khmer -V ietnamese MT system diagram
semantic vector representations. This enables the identifi-
cation of sentences inG that share the same domain as the
sentences inD . Due to this reason, TF-IDF , cosine similar -
ity , and their combination are utilized for ranking.
Data selection based on TF-IDF scor e
The term frequency (TF) measures the frequency of a
term (word or subword) in a sentence, while inverse docu-
ment frequency (IDF) is defined as the proportion of docu-
ments in the corpus that contain the term. So, TF-IDF score
of a wordw in a sentences inG is calculated as:
score
w
= TF − IDF
w
=
F
G
w
W
G
s
· T
D
K
D
w
whereF
G
w
is the frequency ofw ins ;W
G
s
is the number
of words in s ; and K
w
is the number of sentences in D
containw .
The TF-IDF score of the sentences ∈ G is evaluated as:
score
(TF− IDF)
s
=
W
G
s
∑ i=1
score
wi
Data selection based on cosine similarity scor e The co-
sine similarity score between two sentences is calculated
using a Bi-Encoder model [ 38 ]. This model includes a PLM
combined with a pooler layer to encode each sentence as a
sentence-level representation vector . Then, we compute the
cosine similarity between these two vectors.
T o choose the optimal PLM for the V ietnamese (tar get)
language, we build a test set for the masked language model
task, which includes 140,000 V ietnamese sentences from
the Km-V i bilingual dataset. Based on the accuracy of
some well-known PLMs (ie, PhoBER T
1
, XLM-RoBER T a
2
,
1
https://huggingface.co/vinai/phobert-base
2
https://huggingface.co/xlm-roberta-base
mDeBER T a
3
) using this dataset (T able 1 ), XLM-RoBER T a
is selected as the PLM for the Bi-Encoder model.
T able 1: Accuracy of some models on the test set for the masked
language model task.
Models Accuracy
PhoBER T 80%
XLM-RoBER T a 87%
mDeBER T a 75%
The cosine similarity score of a sentence s in the G is
calculated as:
score
(COS)
s
=
1
| D| |D| ∑ i=1
cos(s,D
i
)
where| D| is the number of sentences inD ;D
i
is the i-th
sentence inD .
Data selection based on combination scor e
The combination score is calculated based on the TF-IDF
score and the cosine similarity score:
score
s
=
score
TF− IDF
s
∑ |G| j=1
score
TF− IDF
Gj
+
score
COS
s
∑ |G| j=1
score
COS
Gj
where| G| is the number of sentences inG ;G
i
is the i-th
sentence inG
After assigning these scores to the sentences in the cor -
pusG , the top 120,000 sentences from the tar get-language
dataset with the highest score are selected to translate into
the source language based on the tar get-to-source transla-
tion model.
3
https://huggingface.co/microsoft/mdeberta-v3-base
Khmer -V ietnamese Neural Machine T ranslation… Informatica 47 (2023) 349–360 353
4.1.2 Synthetic data generation
T o increase the number of generated sentence pairs, each
sentence from the tar get-language dataset is translated into
k candidate sentences in the source language using the beam
search (k is beam size) or top-k sampling method. As a
result, k bilingual sentence pairs are created for each sen-
tence in the tar get-language dataset. At this step, the syn-
thetic dataset size can increase significantly . However , this
dataset may contain many low-quality candidates. In the
next section, we will propose our method to filter out the
low-quality candidates.
4.1.3 Synthetic data filtering
Our data filtering approach is based on sentence-level co-
sine similarity . This approach involves comparing the sim-
ilarity between the original sentence and its corresponding
back-translated sentence, enabling us to identify and elimi-
nate sentence pairs that exhibit significant deviations from
the original meaning. Our method distinguishes itself from
Koehn’ s approach [ 36 ] by not requiring an ef fective acqui-
sition of the linear mapping relationship between the em-
bedding spaces of the source and tar get languages. Instead,
we leverage a cosine similarity measure to assess the se-
mantic similarity between sentences.
Data filtering based on cosine similarity An important
aspect of this approach is sentence representation in dif-
ferent languages. Although multilingual LMs (e.g., XLM-
RoBER T a) are possible to do that, the representations for
out-of-the-box sentences are rather bad. Moreover , the vec-
tor spaces of dif ferent languages are not aligned, meaning
that words or sentences with the same meaning in dif fer -
ent languages are represented in dif ferent vectors. Reimers
and Gurevych [ 39 ] proposed a straightforward technique to
ensure consistent vector spaces across dif ferent languages.
This method uses a PLM as a fixed T eacher model that pro-
duces good representation vectors of sentences. The Stu-
dent model is designed to imitate the T eacher model. It
means the same sentence should be represented as the same
vector in the T eacher model and the Student one. T o enable
the Student model to work with additional languages, it is
trained on parallel (translated) sentences. The translation
of each sentence should also be mapped to the same vector
as the original one.
In Figure 2 , the Student model should map “Hello W orld”
and the German translation “Hallo W elt” to the vector of
T eacher(“Hello W orld”). This is achieved by training the
Student model using the mean squared error (MSE) loss.
Based on this approach, we first generate two bilingual
datasets: V ietnamese-English and English-Khmer parallel
sentence pairs from the original Km-V i dataset, using the
Google T ranslator API. This API is taken from the deep
translator
4
. The Student model is then trained on both the
V ietnamese-English dataset and the Khmer -English one to
create semantic vectors for three languages: English, V iet-
4
https://github.com/nidhalof f/deep-translator
Figure 2: Given parallel data (e.g. English and German), train the
student model such that the produced vectors for the English and
German sentences are close to the teacher English sentence vector
[ 39 ].
namese, and Khmer . The representation vector of a sen-
tence is the average of the token embeddings based on the
Student model. W e calculate the sentence-level cosine sim-
ilarity between each parallel in the synthetic parallel dataset
and filter out pairs with low scores.
Data filtering using r ound-trip BLEU
The diagram of this method is represented in Figure 3 .
The process begins with the training of two NMT models:
Km-V i (source-to-tar get) and V i-Km (tar get-to-source), us-
ing the given parallel sentence pairs. Next, we use the
V i-Km translation model to translate the monolingual sen-
tences from the V ietnamese language to the Khmer one. W e
then take the translated sentences and back-translate them
using the Km-V i model. W e evaluate the quality of sen-
tence pairs based on sentence-level BLEU scores and dis-
card sentence pairs with low scores.
Figure 3: The diagram of the data filtering using round-trip BLEU.
4.2 Data augmentation method via english
pivot language
A standard data augmentation method via English pivot lan-
guage involves the translation of sentences in the tar get lan-
guage from the original source-tar get parallel sentence pairs
into English sentences. These English sentences are then
translated into the source language to generate the source-
tar get augmentation bilingual sentence pairs.
W e propose an “aligned” version to improve the qual-
ity of the augmentation dataset. Given the original source-
tar get sentence pair with a source sentencew
s
and a tar get
sentencew
t
, we generate additional candidate sentences in
the following way . The tar get sentencew
t
is translated into
354 Informatica 47 (2023) 349–360 T . N. Quoc et al.
the source language using English pivot one. This step pro-
duces a candidate sentence in the source languagew
c1
. The
tar get-to-source translation model described in Section 4.1
is used to generate another candidate sentence in the source
languagew
c2
. The candidate pairsw
c1
andw
c2
are aggre-
gated to get a temporary dataset. W e carry out two filter -
ing steps to remove low-quality parallel sentence pairs: (i)
align parallel sentence pairs and (ii) data filtering. In the
first step, the temporary dataset is aligned by three tools:
V ecalign
5
, Bleualign
6
, and Hunalign
7
. V ecalign utilizes
word embeddings to align sentences based on semantic sim-
ilarity . Bleualign, on the other hand, uses the BLEU metric
and n-gram overlap to align sentences in bilingual corpora.
Hunalign is a heuristic-based tool that aligns parallel texts
based on sentence length and lexical similarity . Sentence
pairs that are aligned by two-third of the tools are selected to
generate an aligned dataset. In the second step, the aligned
dataset is filtered out based on the data filtering method in
Section 4.1.3 . As a result, we get an augmented dataset,
which is combined with the synthetic parallel dataset from
Section 4.1 and the original bilingual dataset to form the
final training dataset.
5 Experiments
5.1 Experiment setup
W e fine-tuned the mBAR T50 model on an R TX 3090
(24GB) GPU with dif ferent hyperparameters to choose the
optimal parameter set for the model as follows: Adam opti-
mization (learning _rate = 3 e− 5 , β 1
= 0. 9 , β 2
= 0. 999
and ϵ = 1 e− 8 ) with linear learning rate decay scheduling.
The best set of hyperparameters is employed in all our ex-
periments.
T o evaluate the ef fectiveness of our experiments, we used
the BLEU score [ 40 ] through sacreBLEU
8
- an implemen-
tation version to compute the BLEU score. A higher BLEU
score indicates better translation quality .
5.2 Experimental scenarios
T o evaluate the ef fectiveness of our proposed methods for
low-resource NMT , we used the Km-V i biligual dataset
from Nguyen et al. [ 16 ]. This dataset consists of 142,000
parallel sentence pairs, which were divided into a training
set of 140,000 sentence pairs and a test set of 2,000 ones. In
order to prevent biased phenomena in experiments, Nguyen
et al. [ 16 ] randomly selected 2,000 sentence pairs from the
original bilingual dataset to form the test set, following the
distribution ratio of domains and lengths.
Six scenario groups were carried out in our experiments.
Scenario gr oup #1 - Baseline model: Fine-tune the
mBAR T50 model on the original Km-V i bilingual dataset
5
https://github.com/thompsonb/vecalign
6
https://github.com/rsennrich/Bleualign
7
https://github.com/danielvar ga/hunalign
8
https://github.com/mjpost/sacrebleu
(Scenario #1).
All scenario groups from #2 to #6 used additional bilin-
gual datasets which were generated from the V ietnamese
corpus or the Km-V i original bilingual one. This dataset
was combined with the original dataset to create a lar ger
training corpus. The V ietnamese dataset were created by
crawling from online news websites (i.e, vnexpress.net
9
,
dantri.com.vn
10
), then preprocess to remove noise and long
sentences. The langdetect
1 1
library was used to filter out
non-V ietnamese sentences.
Scenario gr oup #2 (#2.1 to #2.6) - Combine Scenario
#1 and Back-translation: T o generate a synthetic paral-
lel dataset, 120,000 sentences from the above mentioned
V ietnamese dataset were selected using our data selection
strategies. These sentences were then translated into the
Khmer language by using our back-translation method. W e
implemented and compared four data selection methods
and two decoding ones (i.e, sampling and beam search).
Scenario gr oup #3 (#3.1 to #3.3) - Combine Scenario #2
and Data filtering: In this scenario, we compared two meth-
ods in the data filtering strategy: the Round-T rip BLEU
[ 35 ] (#3.1) and our proposed sentence-level cosine similar -
ity (#3.2) . W e experimented with two types of data selec-
tion: TF-IDF (#3.1 and #3.2) and combination score(#3.3).
Scenario gr oup #4 (#4.1 to #4.2) - Combine Scenario
#1 and Data augmentation via English pivot language: W e
compared ”standard” and ”aligned” versions to generate an
augmented dataset. The Google T ranslator API is used for
the translation task.
Scenario gr oup #5 (#5.1 to #5.2) - Combine Scenarios
#3 and #4: W e created a new training dataset through the
best settings from Scenarios #3 and #4 .
Scenario gr oup #6 (#6.1 to #6.2)- Combine Scenario
#5 and Data Generation: In this experiment, at the back-
translation step, each sentence from the V ietnamese dataset
was translated into k corresponding Khmer candidate sen-
tences. Then these sentences were filtered and combined
with the original bilingual dataset to create a new training
dataset.
6 Experimental r esults
This section presents a comprehensive evaluation of our
system performance under various scenarios and compares
the best results with other relevant research. The analysis
of the augmented data’ s quality is provided in Appendix 1.
6.1 Analysis our system performance using
differ ent scenarios
W e evaluated our dif ferent scenarios on a test set with 2,000
parallel sentence pairs. The results of our scenarios are pre-
9
https://vnexpress.net
10
https://dantri.com.vn
1 1
https://pypi.or g/project/langdetect
Khmer -V ietnamese Neural Machine T ranslation… Informatica 47 (2023) 349–360 355
T able 2: Experimental results
Data Augmentation Methods
Scenario Name Back-translation via English BLEU (%)
Data Selection Decoding Strategy Data Filtering pivot language
#1 Baseline model - - - - 52.32
#2.1 #1 + Back-translation Randomness Beam search - - 53.16
#2.2 #1 + Back-translation Randomness Sampling - - 53.49
#2.3 #1 + Back-translation TF-IDF Beam search - - 53.83
#2.4 #1 + Back-translation TF-IDF Sampling - - 53.96
#2.5 #1 + Back-translation Cosine similarity Sampling - - 53.98
#2.6 #1 + Back-translation Combination Scor e Sampling - - 54.08
#3.1 #2 + Data Filtering TF-IDF Sampling Round-T rip BLEU - 54.27
#3.2 #2 + Data Filtering TF-IDF Sampling Cosine similarity - 54.38
#3.3 #2 + Data Filtering Combination Scor e Sampling Cosine similarity - 54.48
#4.1 #1 + Data Augmentation - - - Standard 52.98
#4.2 #1 + Data Augmentation - - - Aligned 53.29
#5.1 #3 + #4 TF-IDF Sampling Cosine similarity Standard 54.51
#5.2 #3 + #4 Combination Scor e Sampling Cosine similarity Aligned 54.93
#6.1 #5 + Data Generation Combination Score Sampling Cosine similarity Standard 55.13
#6.2 #5 + Data Generation Combination Scor e Sampling Cosine similarity Aligned 55.37
sented in T able 2 . The baseline Scenario #1 achieved a
52.32% BLEU score.
Scenario gr oup #2 shows that the combination score
gave the best results and the sampling decoding method is
better than the beam search method.
T able 3: Ef fect of BLEU filtering threshold in the data filtering
using round-trip BLEU in the scenario #3.
Scenario Thr eshold BLEU (%)
#3 10 54.02
#3 15 54.27
#3 20 54.16
#3 25 53.80
For scenario gr oups #3 , first, we evaluated the ef fect of
data filtering thresholds to the system’ s performance. T a-
bles 3 and 4 show that the BLEU score increases when the
filter threshold is increased, but up to a certain threshold,
and then reduced. This means that as the filter thresholds
increase, we can filter out more low-quality parallel sen-
tence pairs in the synthetic bilingual dataset, but the size of
this dataset decreases. The best thresholds were then ap-
plied for all scenarios in groups #3 in order to compare the
system performance with other scenarios in T able 2 .
Scenario #4 First, in the standard version, we evaluated
the model’ s performance with dif ferent augmented sizes.
The original bilingual dataset was combined with 30000,
50000, and 70000 augmented sentence pairs created by the
data augmentation via the English pivot language to form
three training datasets. The obtained BLEU scores grad-
ually increased from 52.48%, 52.52%, to 52.98%, propor -
tional to the enhanced data size. The best result using 70000
augmented sentence pairs was used to compare with other
scenarios in T able 2 (Scenario #4.1). Scenario #4.2 also
used 70000 augmented sentence pairs in the aligned ver -
sion.
T able 4: Ef fect of the cosine filtering threshold in the data filtering
using sentence-level cosine similarity in the scenario #3 .
Scenario Thr eshold BLEU (%)
#3 0.5 54.02
#3 0.6 54.36
#3 0.7 54.38
#3 0.8 53.92
W ith a result of 54.93% BLEU score, Scenario #5 shown
the ef fectiveness when combined the best synthetic parallel
datasets from Scenario #3 and 30,000 pair sentences aug-
mented in Scenario #4.
Finally , Scenario #6 , we incorporated Scenario #5 with
our generation strategy to get 55.37% BLEU points, which
improved 3.05% BLEU scores compared to the baseline
model. The results shown that the process of generating
a synthetic dataset based on only one candidate with the
highest probability was not enough. T aking k candidates
and evaluating them helped us to retain more suitable can-
didates.
6.2 Comparison with other models
In addition to our scenario results above, we compared our
best result with some models: Google T ranslator
12
, pre-
trained multilingual seq2seq models, including mBAR T50
[ 37 ], m2m100-1.2B [ 41 ], and nllb-∗ [ 42 ]-a multilingual
translation model introduced by the Facebook AI
13
re-
cently . The results shown in T able 5 indicated that our best
model achieved best results for translating from the Khmer
language to the V ietnamese one. In addition, our current
12
https://github.com/nidhalof f/deep-translator
13
https://ai.facebook.com/
356 Informatica 47 (2023) 349–360 T . N. Quoc et al.
approach had a better performance than our previous model
[ 18 ] with 0.86% BLEU score higher .
T able 5: Comparison our system results to other models
Models BLEU (%)
facebook/mbart50 12.74
facebook/m2m100-1.2B 22.44
facebook/nllb-200-distilled-600M 32.48
facebook/nllb-200-distilled-1.3B 36.51
facebook/nllb-200-3.3B 37.81
Google T ranslator 50.07
Our previous work [ 18 ] 54.51
Our best model 55.37
7 Conclusions
This research presents an approach to address the low-
resource challenge in Khmer -V ietnamese NMT . The pro-
posed method utilizes the pretrained multilingual model
mBAR T as the foundation for the MT system, comple-
mented by various data augmentation strategies to enhance
system performance. These augmentation strategies en-
compass back-translation, data augmentation through an
English pivot language, and synthetic data generation. The
highest performance is achieved when combining the afore-
mentioned augmentation methods with ef fective data se-
lection and data filtering strategies, resulting in a sig-
nificant 3.05% increase in BLUE score compared to the
baseline model utilizing mBAR T with the original dataset.
Our proposed approach outperforms the Google T ranslator
model by 5.3% BLEU score on a test set of 2,000 Khmer -
V ietnamese sentence pairs. Future work involves applying
our proposed approach to other low-resource language pairs
to demonstrate its generalizability .
Refer ences
[1] T . Khanna, J. N. W ashington, and et al. Recent
advances in apertium, a free/open-source rule-based
machine translation platform for low-resource lan-
guages. Machine T ranslation , Dec 2021. https:
//doi.org/10.1007/s10590- 021- 09260- 6 .
[2] P . Koehn, F . J. Och, and et al. Statistical phrase-based
translation. In In Pr oceedings of NAACL , page 48–
54, 2003. https://doi.org/10.3115/1073445.
1073462 .
[3] P . Koehn, H. Hoang, and et al. Moses: Open source
toolkit for statistical machine translation. pages
177–180. Association for Computational Linguis-
tics, 2007. https://doi.org/10.3115/1557769.
1557821 .
[4] K. Cho, B. Merriënboer , and et al. On the properties
of neural machine translation: Encoder–decoder ap-
proaches. In Pr oceedings of EMNLP , pages 103–1 1 1,
2014. https://doi.org/10.3115/v1/w14- 4012 .
[5] D. Suleiman, W . Etaiwi, and A. A wajan. Recurrent
neural network techniques: Emphasis on use in neural
machine translation. In Informatica , 2021. https:
//doi.org/10.31449/inf.v45i7.3743 .
[6] Y . T ian, S. Khanna, and A. Pljonkin. Research on
machine translation of deep neural network learn-
ing model based on ontology . In Informatica , 2021.
https://doi.org/10.31449/inf.v45i5.3559 .
[7] S. Edunov , M. Ott, and et al. Understanding back-
translation at scale. In Pr oceedings of EMNLP , pages
489–500, 2018. https://doi.org/10.18653/v1/
d18- 1045 .
[8] A. V aswani, N. Shazeer , and et al. Attention is
all you need. In Advances in Neural Information
Pr ocessing Systems , volume 30. Curran Associates,
Inc., 2017. https://doi.org/10.48550/arXiv.
1706.03762 .
[9] Y . Chen, Y . Liu, and et al. A teacher -student frame-
work for zero-resource neural machine translation. In
Pr oceedings of ACL (V olume 1: Long Papers) , pages
1925–1935, 2017. https://doi.org/10.18653/
v1/p17- 1176 .
[10] Y . Kim, P . Petrov , and et al. Pivot-based trans-
fer learning for neural machine translation between
non-English languages. In Pr oceedings of EMNLP-
IJCNLP , pages 866–876, 2019. https://doi.org/
10.18653/v1/d19- 1080 .
[1 1] R. Sennrich, B. Haddow , and et al. Improving neu-
ral machine translation models with monolingual data.
In Pr oceedings of ACL (V olume 1: Long Papers) ,
pages 86–96, 2016. https://doi.org/10.18653/
v1/p16- 1009 .
[12] J. Zhang and C. Zong. Exploiting source-side mono-
lingual data in neural machine translation. In Pr o-
ceedings of EMNLP , pages 1535–1545, 2016. https:
//doi.org/10.18653/v1/d16- 1160 .
[13] L. Mike, L. Y inhan, and et al. BAR T : Denois-
ing sequence-to-sequence pre-training for natural lan-
guage generation, translation, and comprehension.
In ACL , 2020. https://doi.org/10.18653/v1/
2020.acl- main.703 .
[14] C. Raf fel, N Shazeer , A. Roberts, and et al. Exploring
the limits of transfer learning with a unified text-to-
text transformer . The Journal of Machine Learning
Resear ch , 21(1):5485–5551, 2020. https://doi.
org/10.48550/arXiv.1910.10683 .
Khmer -V ietnamese Neural Machine T ranslation… Informatica 47 (2023) 349–360 357
[15] Y . Liu, J. Gu, N. Goyal, and et al. Multilingual denois-
ing pre-training for neural machine translation. T rans-
actions of ACL , 8:726–742, 2020. https://doi.
org/10.1162/tacl_a_00343 .
[16] V an-V inh Nguyen, , Huong Le-Thanh, and et al.
KC4MT : A high-quality corpus for multilingual ma-
chine translation. In Pr oceedings of LREC , page
5494–5502, 2022.
[17] N. H. Quan, N. T . Dat, N. H. M. Cong, and et al.
V iNMT : Neural machine translation toolkit, 2021.
https://doi.org/10.48550/arXiv.2112.
15272 .
[18] V .H Pham and Le T .H. Improving khmer -vietnamese
machine translation with data augmentation methods.
In Pr oceedings of SoICT ’22 , pages 276–282, 2022.
https://doi.org/10.1145/3568562.3568646 .
[19] J. Devlin, M. Chang, and et al. BER T : pre-training
of deep bidirectional transformers for language un-
derstanding. In Pr oceedings of NAACL: Human Lan-
guage T echnologies , pages 4171–4186, 2019. http:
//doi.org/10.18653/v1/n19- 1423 .
[20] J. Zhu, Y . Xia, L W u, and et al. Incorporating bert
into neural machine translation, 2020. https://
openreview.net/forum?id=Hyl7ygStwB .
[21] S. Rothe, S. Narayan, and et al. Leveraging pre-
trained checkpoints for sequence generation tasks.
T ransactions of ACL , 8:264–280, 2020. https://
doi.org/10.1162/tacl_a_00313 .
[22] B. Zoph, D. Y uret, and et al. T ransfer learning for low-
resource neural machine translation. In Pr oceedings
of EMNLP , pages 1568–1575, 2016. https://doi.
org/10.18653/v1/d16- 1163 .
[23] J. Hu, L. Zhang, and D. Y u. Improved neural machine
translation with paraphrase-based synthetic data. In
Pr oceedings of NAACL , 2019.
[24] X. Niu and et al. Subword-level word-interleaving
data augmentation for neural machine translation. In
Pr oceedings of EMNLP , 2018.
[25] Z. Liu and et al. W ord deletion data augmentation for
low-resource neural machine translation. In Pr oceed-
ings of ACL , 2021.
[26] H. W ang and et al. Multi-objective data augmenta-
tion for low-resource neural machine translation. In
Pr oceedings of IJCAI , 2019.
[27] C. Chu and et al. Domain adaptation for neural ma-
chine translation with limited resources. In Pr oceed-
ings of EMNLP , 2020.
[28] M. Johnson, M. Schuster , and et al. Google’ s multi-
lingual neural machine translation system: Enabling
zero-shot translation. T ransactions of ACL , 5:339–
351, 2017. https://doi.org/10.1162/tacl_a_
00065 .
[29] R. C. Moore and W . Lewis. Intelligent selection of
language model training data. pages 220–224. Pro-
ceedings of ACL, 2010. https://aclanthology.
org/P10- 2041 .
[30] M. W ees, A. Bisazza, and et al. Dynamic data se-
lection for neural machine translation. In Pr oceed-
ings of EMNLP , pages 1400–1410, 2017. https:
//doi.org/10.48550/arXiv.1708.00712 .
[31] R. W ang, A. Finch, and et al. Sentence embedding
for neural machine translation domain adaptation. In
Pr oceedings of ACL , pages 560–566, 2017. https:
//doi.org/10.18653/v1/p17- 2089 .
[32] S. Zhang and D. Xiong. Sentence weighting for neural
machine translation domain adaptation. In Pr oceed-
ings of the 27th International Confer ence on Compu-
tational Linguistics , pages 3181–3190, August 2018.
https://aclanthology.org/C18- 1269 .
[33] C. C. Silva, C. Liu, and et al. Extracting in-domain
training corpora for neural machine translation us-
ing data selection methods. In Pr oceedings of the
Thir d Confer ence on Machine T ranslation , pages
224–231, 2018. https://doi.org/10.18653/v1/
w18- 6323 .
[34] A. Poncelas and et al. Data selection with fea-
ture decay algorithms using an approximated tar get
side. 2018. https://doi.org/10.48550/arXiv.
1811.03039 .
[35] A. Imankulova, T . Sato, and et al. Improving
low-resource neural machine translation with filtered
pseudo-parallel corpus. pages 70–78. Asian Federa-
tion of Natural Language Processing, 2017. https:
//aclanthology.org/W17- 5704 .
[36] P . Koehn, H. Khayrallah, and et al. Findings of the
WMT 2018 shared task on parallel corpus filtering.
In Pr oceedings of the Thir d Confer ence on Machine
T ranslation , pages 726–739, 2018. http://doi.
org/10.18653/v1/w18- 6453 .
[37] Y . T ang, C. T ran, X. Li, and et al. Multilingual trans-
lation with extensible multilingual pretraining and
finetuning, 2020. https://doi.org/10.48550/
arXiv.2008.00401 .
[38] J. Cho, E. Jung, and et al. Improving bi-encoder doc-
ument ranking models with two rankers and multi-
teacher distillation. In Pr oceedings of SIGIR ’21 , page
2192–2196, 2021. https://doi.org/10.1145/
3404835.3463076 .
358 Informatica 47 (2023) 349–360 T . N. Quoc et al.
[39] N. Reimers and I. Gurevych. Making monolingual
sentence embeddings multilingual using knowledge
distillation, 2020. https://doi.org/10.48550/
arXiv.2004.09813 .
[40] K. Papineni, S. Roukos, and et al. Bleu: a method
for automatic evaluation of machine translation. In
Pr oceedings of ACL , pages 31 1–318, 2002. http:
//doi.org/10.3115/1073083.1073135 .
[41] A. Fan, S. Bhosale, H. Schwenk, and et al. Be-
yond english-centric multilingual machine transla-
tion, 2020. https://doi.org/10.48550/arXiv.
2010.11125 .
[42] NLLB T eam. No language left behind: Scaling
human-centered machine translation, 2022. https:
//doi.org/10.48550/arXiv.2207.04672 .
A Appendix 1
T o assess the quality of the augmented data, we present ex-
emplary outputs from two methods described in Section 4
in T ables 6 and 7 .
T able 6 exhibits examples generated by the back-
translation method. V i-Km sentence pairs in the first and
second columns are added to the augmented training dataset
if they pass the synthetic data filtering step. The ta-
ble reveals that the NMT models employed in the back-
translation process may still produce semantically incor -
rect sentences, particularly when translating proper names.
Such sentences are subsequently filtered out during the data
filtering process. Notably , no sentence pairs in the aug-
mented dataset by this method exhibit poor quality .
In T able 7 , we present examples of data augmentation via
the English pivot language. Due to the relatively high qual-
ity of Google T ranslator , the augmented V i-Km sentence
pairs demonstrate a relatively high quality when the original
Km-V i sentence pair possesses good quality . However , dis-
crepancies arise when the original V ietnamese-Khmer sen-
tence pairs do not maintain complete semantic equivalence,
leading to a similar outcome for the newly generated Khmer
sentence and the original V ietnamese one. Consequently , in
such instances, the data filtering step excludes the incorpo-
ration of the new sentence pair into the augmented dataset.
Khmer -V ietnamese Neural Machine T ranslation… Informatica 47 (2023) 349–360 359
T able 6: Output examples of synthetic data generation process
V i sentence (V ietnamese Mono-
lingual Dataset)
Km sentence generated by the
V i-Km model
V i sentence generated by the
Km-V i model
Action
V iệt Nam đã thâm nhập và mở
rộng thương mại tại thị trường
này ./ V ietnam has penetrated and ex-
panded its trade in this market.
េវៀតណាម បាន េ្រជ�ត ច ូ ល ន ិ ង ព្រង ី ក
ពាណ ិ ជ្ជកម្ម ក្ន
ុ ង ទ ី ផ�រ េនះ ។ / V ietnam
has penetrated and expanded trade in this
market.
V iệt Nam đã tham gia và mở
rộng thương mại tại thị trường
này ./ V ietnam has joined and expanded
trade in this market.
Keep
Đoàn đại biểu kiều bào đã đến
dâng hương ở tượng đài V ua
Lê./ The overseas V ietnamese delegation
came to offer incense at the statue of King
Le.
គណៈ្របត ិ ភ ូ ម កព ី ្របេទស េវៀតណាម
បាន មក ប ួ ងស ួ ង េន រ ូ បស ំ ណាក េស ្ត ច
Li Lei ។ / A delegation fr om V ietnam came
to pray in the statue of King Li Lei.
Một phái đoàn từ V iệt Nam đã đến
thăm các khu vực của Hoàng gia Li
Lei./ A delegation fr om V ietnam visited the
ar eas of Royal Li Lei.
Filter
out
Theo đó, các dụng cụ này dao
động mức từ vài chục cho đến
hàng chục triệu đồng./ Accor dingly ,
these tools range fr om a few tens to tens of
millions of dong.
តាមរយៈ េនះ ឧបករណ ៍ ទា ំ ងេនះ មាន
តៃម្ល ព ី ម ួ យ ដង េទ ម ួ យ ដង េទ ម ួ យ
ដង ។ / Thr ough this, these devices ar e
priced fr om time to time.
Bằng cách này , những thiết bị này
có giá trị một lần, một lần, một
lần./ This way , these devices ar e worth it
once, once, once.
Filter
out
T able 7: Output examples of data augmentation process via english pivot language
Original Km sentence Original V i sentence Km augmented sentence Action
កញ ្ច ប ់ ទ ិ ន្នន ័ យ ្រត�វបាន តេ្រម�ប តាម
ត ំ បន ់ ស្រមាប ់ អ្នកទ ិ ញ ងាយ�ស�ល
េ្រជ ើ សេរ ើ ស ។ / The data packets ar e sorted
by ar ea for the buyer to easily select.
Các gói data được chia ra theo
khu vực để người mua dễ dàng lựa
chọn./ The data packages ar e divided by r e-
gion for buyers to easily choose
កញ ្ច ប ់ ទ ិ ន្នន ័ យ ្រត�វបាន ចាត ់ ថា ្ន ក ់ តាម ត ំ បន ់ េដ ើ ម្បី ងាយ�ស�ល េ្រជ ើ សេរ ើ ស
អ្នកទ ិ ញ ។ / Data packages ar e catego-
rized by ar ea for easy selection of buyers.
Keep
មន ុ ស្ស ្របមាណ ៥០០ លាន នាក ់ អាច
្របឈម ន ឹ ង ភាព្រក ី ្រក េដាយសារ វ ិ បត្ត ិ េសដ្ឋក ិ ច្ច ដ ៏ ធ្ងន ់ ធ្ងរ ប ំ ផ ុ ត តា ំ ងព ី ម ុ ន មក
។ / An estimated 500 million people could
face poverty due to the worst economic
crisis ever .
Thế giới đang đối mặt với cuộc suy
thoái kinh tế sâu sắc nhất, được đánh
giá là nghiêm trọng hơn các cuộc
khủng hoảng trước đây ./ The world is
facing the deepest economic r ecession, which
is consider ed to be mor e sever e than pr evious
crises.
ព ិ ភពេលាក ក ំ ព ុ ង ្របឈមម ុ ខ ន ឹ ង វ ិ បត្ត ិ េសដ្ឋក ិ ច្ច ដ ៏ េ្រជ ប ំ ផ ុ ត ែដល ្រត�វបាន េគ
ចាត ់ ទ ុ ក ថា ធ្ងន ់ ធ្ងរ ជាង វ ិ បត្ត ិ ម ុ ន ៗ / The
world is facing the deepest economic r e-
cession, which is consider ed to be mor e se-
ver e than pr evious crises.
Filter
out
360 Informatica 47 (2023) 349–360 T . N. Quoc et al.