https://doi.or g/10.31449/inf.v47i3.4761 Informatica 47 (2023) 349–360 349 Khmer -V ietnamese Neural Machine T ranslation Impr ovement Using Data Augmentation Strategies Thai Nguyen Quoc 1 , Huong Le Thanh 1, ∗ and Hanh Pham V an 2 1 School of Information and Communication T echnology , Hanoi University of Science and T echnology , V ietnam 2 FPT AI Center E-mail: thai.nq212642m@sis.hust.edu.vn, huonglt@soict.hust.edu.vn, hanhpv@fsoft.com.vn Keywords: machine translation, data augmentation, low-resource, Khmer -V ietnamese Received: March 22, 2023 The development of neural models has gr eatly impr oved the performance of machine translation, but these methods r equir e lar ge-scale parallel data, which can be difficult to obtain for low-r esour ce language pairs. T o addr ess this issue, this r esear ch employs a pr e-trained multilingual model and fine-tunes it by using a small bilingual dataset. Additionally , two data-augmentation strategies ar e pr oposed to generate new training data: (i) back-translation with the dataset fr om the sour ce language; (ii) data augmentation via the English pivot language. The pr oposed appr oach is applied to the Khmer -V ietnamese machine translation. Experimental r esults show that our pr oposed appr oach outperforms the Google T ranslator model by 5.3% in terms of BLEU scor e on a test set of 2,000 Khmer -V ietnamese sentence pairs. Povzetek: Raziskava uporablja pr edhodno usposobljen večjezični model in povečanje podatkov . Rezultati pr esegajo Google T ranslator za 5,3%. 1 Intr oduction Machine translation (MT) is the task of automatically translating text from one language to another . There are three common approaches to MT : rule-based approach [ 1 ], statistical-based approach [ 2 , 3 ], and neural-based one [ 4 , 5 , 6 ]. The rule-based approach depends on translation rules and dictionaries created by human experts. Statisti- cal Machine T ranslation (SMT) relies on techniques like word alignment and language modeling to optimize the translation process. While SMT can handle a wide range of languages and translation scenarios, it often struggles with capturing complex linguistic phenomena and handling long-range dependencies. W ith significant advancements in deep learning, Neural Machine T ranslation (NMT) ap- proaches have shown great potential and have replaced SMT as the primary approach to MT . NMT models capture contextual information, handle word reordering, and gener - ate fluent and natural translations. NMT has gained popu- larity due to its end-to-end learning, ability to handle com- plex linguistic phenomena, and improved translation qual- ity . Among all NMT systems, transformer -based MT mod- els [ 7 , 8 ] have demonstrated superior performance. The key feature of transformer models [ 8 ] is their attention mecha- nism, which allows them to ef fectively capture dependen- cies between dif ferent words in a sentence. Unlike tradi- tional recurrent neural networks that process words sequen- tially , transformers can consider the entire input sentence simultaneously . This parallelization significantly speeds up the training process and makes transformers more ef ficient for long-range dependencies. One notable limitation of NMT techniques pertains to their reliance on a substantial number of parallel sentence pairs to facilitate model training. Unfortunately , most of language pairs in the world are lack of such a lar ge dataset. Consequently , these language pairs fall under the category of low-resource MT , presenting a challenging scenario for the application of neural-based models in this domain. Several works have carried out research to solve the low- resource problem in NMT . Chen et al.[ 9 ], Kim et al. [ 10 ] dealt with the low-resource NMT by using pivot transla- tions, where one or more pivot languages were selected as a bridge between the source and tar get languages. The source-pivot and the pivot-tar get should be rich-resource language pairs. Sennric et al. [ 1 1 ], Zhang [ 12 ] applied the forward/backward translation approaches to generate paral- lel sentence pairs by translating the monolingual sentences to the tar get/source language via a translation system. Then, the pseudo parallel data was mixed with the original parallel data to train an NMT model. A problem in this approach is how to control the quality of the pseudo parallel dataset in order to improve the performance of the low-resource NMT system. Since NMT requires the capability of both language understanding (e.g., NMT encoder) and generation (e.g., NMT decoder), pre-training language model can be very helpful for NMT , especially low-resource NMT . T o do this task, BAR T model [ 13 ] has been proposed to add noises and randomly masked some tokens in the input sentences in the encoder , and learn to reconstruct the original text in the decoder . T5 model [ 14 ] randomly masks some tokens and replace the consecutive tokens with a single sentinel 350 Informatica 47 (2023) 349–360 T . N. Quoc et al. token. T o address the low-resource problem in NMT , we pro- pose to fine-tune mBAR T [ 15 ] - a pretrained multilingual Bidirectional and Auto-Regressive T ransformers model that has been specifically designed for multilingual appli- cations, including MT . The fine-tuning process is com- bined with several strategies, including the utilization of back-translation techniques [ 1 1 ] and data augmentation via a pivot language. W e propose several data augmentation strategies to augment training data as well as controlling the data quality . Our proposed approach can be applied to any low- resource language pairs. However , in this research, we evaluate our approach by implementing it with the low- resource Khmer -V ietnamese (Km-V i) language pair , using a dataset with 142,000 parallel sentence pairs from Nguyen et al. [ 16 ]. As far as we know , there is only two works dealing with the Km-V i machine translation ([ 17 ], [ 18 ]). Nguyen et al. [ 17 ] presented an open-source MT toolkit for low-resource language pairs. However , this approach only used a transformer architecture to train the original dataset, without applying fine-tuning, transfer learning, or additional data augmentation techniques. Pham and Le [ 18 ] fine-tuned mBAR T and applied some data augmentation strategies. In this research, we have extended the work in [ 18 ] to improve the performance of the Km-V i NMT sys- tem. The contributions are as follows: – W e propose new methods for data selection based on sentence-level cosine similarity through the bi- encoder model [ 19 ] combined with the TF-IDF score. – W e suggest a data generation strategy to generate best candidates for the synthetic parallel dataset. – T o control the quality of augmented data, we propose an “aligned” version to enrich the data and a two- step filtering to eliminate low quality parallel sentence pairs. The remainder of this paper is or ganized as follows. Sec- tion 2 analyzes various techniques in existing research to address the limitations of low-resource NMT . Section 3 describes our system diagram. Our proposed data aug- mentation strategies are outlined in Section 4 . Section 5 elaborates on the experimental design, whereas Section 6 presents an analysis of the empirical outcomes. Finally , Section 7 concludes our paper . 2 Related work Pr etrain Language Models (PLMs) have proven to be helpful instruments in the context of low-resource NMT . Literature has shown that low-resource NMT models can benefit from the use of a single PLM [ 20 , 21 ] or a multilin- gual one [ 13 ]. The multilingual PLM is claimed to facili- tate more ef fective learning of the connection between the source and the tar get representations for translation. These transfer learning methods leverage rich-resource language pairs to train the system, then fine-tune all the parameters on the specific tar get language pair [ 22 ]. The rich-resource language pairs should be in a similar language family to the low-resource ones, to have good results. Data augmentation is the method of generating ad- ditional data, achieved by expanding the original dataset or integrating supplementary data from relevant sources. V arious approaches to data augmentation have been ex- plored, including: (i) paraphrasing and sentence simplifica- tion [ 23 ], (ii) word substitution and deletion [ 24 , 25 ], (iii) limited and constrained word order permutation [ 26 ], (iv) domain adaptation [ 27 ], (v) back-translation [ 1 1 ], and (vi) data augmentation via a pivot language [ 28 ]. Paraphrasing and sentence simplification [ 23 ] of fer var - ied quality , with a risk of introducing semantic changes or losing important information. W ord substitution [ 24 ] re- quires careful selection of synonyms to maintain accuracy , while word deletion [ 25 ] can introduce noise and requires ef fective training to handle missing information. Limited and constrained word order permutation [ 26 ] suits language pairs with word order variations but requires defining com- plex constraints based on language characteristics. Domain adaptation [ 27 ] addresses the challenge of domain-specific low-resource machine translation, which is not the tar get of this research. Back-translation [ 1 1 ] has proven success- ful by generating synthetic source sentences through trans- lating tar get sentences. However , this approach carries a risk of errors due to imperfections in pre-trained translation models. On the other hand, the pivot-based approach [ 28 ] involves translating low-resource language pairs through a high-resource language. This approach relies on good translation quality to and from the pivot language. Back-translation and pivot-based translation are consid- ered reliable and generalizable approaches when comple- mented by ef fective post-processing methods for filtering low-quality data. Therefore, this paper specifically concen- trates on utilizing back-translation and pivot-based transla- tion as the selected methods for data augmentation. T o im- prove the quality of the synthetic parallel data generated by these methods, two strategies are employed: (i) data selec- tion and (ii) synthetic data filtering. Data selection is the process of ranking and selecting a subset from a tar get monolingual dataset that ensures in- domain as the training data. The objective of this process is to improve the performance of an NMT system for a par - ticular domain. V arious techniques for data selection have been proposed in the literature, such as computing sentence scores based on Cross-Entropy Dif ference (CED) [ 29 , 30 ], and using representation vectors to rank sentences in the monolingual dataset [ 31 , 32 ]. Three data selection methods had been implemented by Silva et al. [ 33 ], namely CED, TF-IDF , and Feature Decay Algorithms (FDA) [ 34 ]. The experimental results pointed out that the TF-IDF method gained the best improvements in both BLEU and TER (T ranslation Error Rate) scores. Synthetic data filtering T o filter out low-quality sen- Khmer -V ietnamese Neural Machine T ranslation… Informatica 47 (2023) 349–360 351 tence pairs, Imankulova et al. [ 35 ] proposed a method based on the BLEU measure. This method involves lever - aging a source-to-tar get NMT model to translate the syn- thetic source sentences into synthetic tar get sentences. Sub- sequently , the sentence-level BLEU score is calculated for each sentence pair between the synthetic tar get sentence and the tar get sentence, with the ultimate objective of ex- cluding low-score sentences. Koehn et al. [ 36 ] proposed another approach based on the sentence-level cosine simi- larity of two sentences. However , their proposal required an ef fective acquisition of the linear mapping relationship between the two embedding spaces of the source language and the tar get one. Another way to improve translation quality is by us- ing data augmentation methods via a pivot language [ 28 ]. This method involves translating sentences from the source language to the pivot one using the source-pivot transla- tion model, followed by translating sentences in the pivot language to sentences in the tar get language. However , there are certain restrictions associated with this technique. Firstly , the circular translation process increases the decod- ing time during inference as it can iterate through multiple languages to obtain the desired quality . Secondly , transla- tion errors may arise in each step, which can lead to low- quality translation of the sentence in the tar get language. In this paper , we introduce an approach aimed at enhanc- ing the performance of low-resource MT . Our approach in- corporates multiple data augmentation strategies alongside various data filtering methods to improve the quality of syn- thetic data. In the subsequent sections, we introduce these methods in detail. 3 Our system diagram As previously mentioned, our goal is to propose strategies that can improve the performance of low-resource NMT systems. The proposed approach will be applied for the Km-V i language pair . T o do this, we first fine-tune the mBAR T50 [ 37 ] model with the Km-V i bilingual dataset. The mBAR T model Multilingual BAR T (mBAR T) [ 15 ] is a sequence-to-sequence denoising auto-encoder that was pre-trained on lar ge-scale monolingual corpora in many languages using the BAR T objective [ 13 ]. The pre-trained task is to reconstruct the original text from the noise one, us- ing two types of noise: random span masking and order per - mutation. A special variant of mBAR T called mBAR T50 [ 37 ], has been trained in 50 languages, including Khmer and V ietnamese. Nonetheless, the mBAR T50’ s translation quality of the Km-V i language pairs is low . T o deal with this problem, we propose to fine-tune the mBAR T50 with the Km-V i bilingual dataset combined with the augmented dataset through several strategies. Our proposed Khmer -V ietnamese MT system model is described in Figure 1 , which incorporates two strategies for data augmentation: (i) back-translation with a dataset in the tar get language; and (ii) data augmentation via En- glish pivot language. These strategies will be introduced in the next section. 4 Data augmentation strategies Since the word orders and theirs meaning in machine trans- lation are important, methods such as paraphrasing, sim- plification, limited and constrained word order permutation cannot provide good parallel sentence pairs. 4.1 Back-translation with a dataset in the target language Back-translation method proposed by Senrich et al [ 1 1 ] is an useful way to generate additional training data for low- resource NMT . This method leverages an external dataset in the tar get language, termed the ”tar get-language dataset”. It employs a tar get-to-source NMT model, trained on the orig- inal bilingual dataset, to translate this dataset into the source language. The resulting translated sentences are then com- bined with their corresponding tar get sentences, creating a synthetic bilingual dataset. However , the dataset’ s quality generated by this method is not guaranteed. T o address this issue, we improve this method by integrating data filtering techniques to the back-translation process. Our proposed method is conducted in three steps as follow: – Step 1 - Data selection: Rank and select sentences from a tar get-language dataset that is in the same do- main as sentences in the original bilingual dataset. – Step 2 - Data generation: Each sentence from the output dataset in Step 1 is translated to k sentential candidates in the source language using the tar get- to-source NMT model which has been trained by fine-tuning the mBAR T50 with the original bilingual dataset. – Step 3 - Data filtering: Filter out low-quality bilin- gual sentence pairs in the synthetic parallel dataset. W e will discuss these three steps in the following sec- tions. 4.1.1 Data selection For a given dataset D consisting of T sentence pairs in a specific domain, and a set of sentences in a general domain G , the aim of data selection is to rank the sentences in G based on their similarity to the domain ofD , then selecting highest-ranked sentences to form a subset that shares the same domain as D . Given that TF-IDF is a popular tech- nique used to identify representative words for a dataset, we can assess whether sentences in G belong to the same domain asD using this measure. In addition to the TF-IDF measure, cosine similarity can be employed to measure the semantic similarity between two sentences based on their 352 Informatica 47 (2023) 349–360 T . N. Quoc et al. Figure 1: Our proposed Khmer -V ietnamese MT system diagram semantic vector representations. This enables the identifi- cation of sentences inG that share the same domain as the sentences inD . Due to this reason, TF-IDF , cosine similar - ity , and their combination are utilized for ranking. Data selection based on TF-IDF scor e The term frequency (TF) measures the frequency of a term (word or subword) in a sentence, while inverse docu- ment frequency (IDF) is defined as the proportion of docu- ments in the corpus that contain the term. So, TF-IDF score of a wordw in a sentences inG is calculated as: score w = TF − IDF w = F G w W G s · T D K D w whereF G w is the frequency ofw ins ;W G s is the number of words in s ; and K w is the number of sentences in D containw . The TF-IDF score of the sentences ∈ G is evaluated as: score (TF− IDF) s = W G s ∑ i=1 score wi Data selection based on cosine similarity scor e The co- sine similarity score between two sentences is calculated using a Bi-Encoder model [ 38 ]. This model includes a PLM combined with a pooler layer to encode each sentence as a sentence-level representation vector . Then, we compute the cosine similarity between these two vectors. T o choose the optimal PLM for the V ietnamese (tar get) language, we build a test set for the masked language model task, which includes 140,000 V ietnamese sentences from the Km-V i bilingual dataset. Based on the accuracy of some well-known PLMs (ie, PhoBER T 1 , XLM-RoBER T a 2 , 1 https://huggingface.co/vinai/phobert-base 2 https://huggingface.co/xlm-roberta-base mDeBER T a 3 ) using this dataset (T able 1 ), XLM-RoBER T a is selected as the PLM for the Bi-Encoder model. T able 1: Accuracy of some models on the test set for the masked language model task. Models Accuracy PhoBER T 80% XLM-RoBER T a 87% mDeBER T a 75% The cosine similarity score of a sentence s in the G is calculated as: score (COS) s = 1 | D| |D| ∑ i=1 cos(s,D i ) where| D| is the number of sentences inD ;D i is the i-th sentence inD . Data selection based on combination scor e The combination score is calculated based on the TF-IDF score and the cosine similarity score: score s = score TF− IDF s ∑ |G| j=1 score TF− IDF Gj + score COS s ∑ |G| j=1 score COS Gj where| G| is the number of sentences inG ;G i is the i-th sentence inG After assigning these scores to the sentences in the cor - pusG , the top 120,000 sentences from the tar get-language dataset with the highest score are selected to translate into the source language based on the tar get-to-source transla- tion model. 3 https://huggingface.co/microsoft/mdeberta-v3-base Khmer -V ietnamese Neural Machine T ranslation… Informatica 47 (2023) 349–360 353 4.1.2 Synthetic data generation T o increase the number of generated sentence pairs, each sentence from the tar get-language dataset is translated into k candidate sentences in the source language using the beam search (k is beam size) or top-k sampling method. As a result, k bilingual sentence pairs are created for each sen- tence in the tar get-language dataset. At this step, the syn- thetic dataset size can increase significantly . However , this dataset may contain many low-quality candidates. In the next section, we will propose our method to filter out the low-quality candidates. 4.1.3 Synthetic data filtering Our data filtering approach is based on sentence-level co- sine similarity . This approach involves comparing the sim- ilarity between the original sentence and its corresponding back-translated sentence, enabling us to identify and elimi- nate sentence pairs that exhibit significant deviations from the original meaning. Our method distinguishes itself from Koehn’ s approach [ 36 ] by not requiring an ef fective acqui- sition of the linear mapping relationship between the em- bedding spaces of the source and tar get languages. Instead, we leverage a cosine similarity measure to assess the se- mantic similarity between sentences. Data filtering based on cosine similarity An important aspect of this approach is sentence representation in dif- ferent languages. Although multilingual LMs (e.g., XLM- RoBER T a) are possible to do that, the representations for out-of-the-box sentences are rather bad. Moreover , the vec- tor spaces of dif ferent languages are not aligned, meaning that words or sentences with the same meaning in dif fer - ent languages are represented in dif ferent vectors. Reimers and Gurevych [ 39 ] proposed a straightforward technique to ensure consistent vector spaces across dif ferent languages. This method uses a PLM as a fixed T eacher model that pro- duces good representation vectors of sentences. The Stu- dent model is designed to imitate the T eacher model. It means the same sentence should be represented as the same vector in the T eacher model and the Student one. T o enable the Student model to work with additional languages, it is trained on parallel (translated) sentences. The translation of each sentence should also be mapped to the same vector as the original one. In Figure 2 , the Student model should map “Hello W orld” and the German translation “Hallo W elt” to the vector of T eacher(“Hello W orld”). This is achieved by training the Student model using the mean squared error (MSE) loss. Based on this approach, we first generate two bilingual datasets: V ietnamese-English and English-Khmer parallel sentence pairs from the original Km-V i dataset, using the Google T ranslator API. This API is taken from the deep translator 4 . The Student model is then trained on both the V ietnamese-English dataset and the Khmer -English one to create semantic vectors for three languages: English, V iet- 4 https://github.com/nidhalof f/deep-translator Figure 2: Given parallel data (e.g. English and German), train the student model such that the produced vectors for the English and German sentences are close to the teacher English sentence vector [ 39 ]. namese, and Khmer . The representation vector of a sen- tence is the average of the token embeddings based on the Student model. W e calculate the sentence-level cosine sim- ilarity between each parallel in the synthetic parallel dataset and filter out pairs with low scores. Data filtering using r ound-trip BLEU The diagram of this method is represented in Figure 3 . The process begins with the training of two NMT models: Km-V i (source-to-tar get) and V i-Km (tar get-to-source), us- ing the given parallel sentence pairs. Next, we use the V i-Km translation model to translate the monolingual sen- tences from the V ietnamese language to the Khmer one. W e then take the translated sentences and back-translate them using the Km-V i model. W e evaluate the quality of sen- tence pairs based on sentence-level BLEU scores and dis- card sentence pairs with low scores. Figure 3: The diagram of the data filtering using round-trip BLEU. 4.2 Data augmentation method via english pivot language A standard data augmentation method via English pivot lan- guage involves the translation of sentences in the tar get lan- guage from the original source-tar get parallel sentence pairs into English sentences. These English sentences are then translated into the source language to generate the source- tar get augmentation bilingual sentence pairs. W e propose an “aligned” version to improve the qual- ity of the augmentation dataset. Given the original source- tar get sentence pair with a source sentencew s and a tar get sentencew t , we generate additional candidate sentences in the following way . The tar get sentencew t is translated into 354 Informatica 47 (2023) 349–360 T . N. Quoc et al. the source language using English pivot one. This step pro- duces a candidate sentence in the source languagew c1 . The tar get-to-source translation model described in Section 4.1 is used to generate another candidate sentence in the source languagew c2 . The candidate pairsw c1 andw c2 are aggre- gated to get a temporary dataset. W e carry out two filter - ing steps to remove low-quality parallel sentence pairs: (i) align parallel sentence pairs and (ii) data filtering. In the first step, the temporary dataset is aligned by three tools: V ecalign 5 , Bleualign 6 , and Hunalign 7 . V ecalign utilizes word embeddings to align sentences based on semantic sim- ilarity . Bleualign, on the other hand, uses the BLEU metric and n-gram overlap to align sentences in bilingual corpora. Hunalign is a heuristic-based tool that aligns parallel texts based on sentence length and lexical similarity . Sentence pairs that are aligned by two-third of the tools are selected to generate an aligned dataset. In the second step, the aligned dataset is filtered out based on the data filtering method in Section 4.1.3 . As a result, we get an augmented dataset, which is combined with the synthetic parallel dataset from Section 4.1 and the original bilingual dataset to form the final training dataset. 5 Experiments 5.1 Experiment setup W e fine-tuned the mBAR T50 model on an R TX 3090 (24GB) GPU with dif ferent hyperparameters to choose the optimal parameter set for the model as follows: Adam opti- mization (learning _rate = 3 e− 5 , β 1 = 0. 9 , β 2 = 0. 999 and ϵ = 1 e− 8 ) with linear learning rate decay scheduling. The best set of hyperparameters is employed in all our ex- periments. T o evaluate the ef fectiveness of our experiments, we used the BLEU score [ 40 ] through sacreBLEU 8 - an implemen- tation version to compute the BLEU score. A higher BLEU score indicates better translation quality . 5.2 Experimental scenarios T o evaluate the ef fectiveness of our proposed methods for low-resource NMT , we used the Km-V i biligual dataset from Nguyen et al. [ 16 ]. This dataset consists of 142,000 parallel sentence pairs, which were divided into a training set of 140,000 sentence pairs and a test set of 2,000 ones. In order to prevent biased phenomena in experiments, Nguyen et al. [ 16 ] randomly selected 2,000 sentence pairs from the original bilingual dataset to form the test set, following the distribution ratio of domains and lengths. Six scenario groups were carried out in our experiments. Scenario gr oup #1 - Baseline model: Fine-tune the mBAR T50 model on the original Km-V i bilingual dataset 5 https://github.com/thompsonb/vecalign 6 https://github.com/rsennrich/Bleualign 7 https://github.com/danielvar ga/hunalign 8 https://github.com/mjpost/sacrebleu (Scenario #1). All scenario groups from #2 to #6 used additional bilin- gual datasets which were generated from the V ietnamese corpus or the Km-V i original bilingual one. This dataset was combined with the original dataset to create a lar ger training corpus. The V ietnamese dataset were created by crawling from online news websites (i.e, vnexpress.net 9 , dantri.com.vn 10 ), then preprocess to remove noise and long sentences. The langdetect 1 1 library was used to filter out non-V ietnamese sentences. Scenario gr oup #2 (#2.1 to #2.6) - Combine Scenario #1 and Back-translation: T o generate a synthetic paral- lel dataset, 120,000 sentences from the above mentioned V ietnamese dataset were selected using our data selection strategies. These sentences were then translated into the Khmer language by using our back-translation method. W e implemented and compared four data selection methods and two decoding ones (i.e, sampling and beam search). Scenario gr oup #3 (#3.1 to #3.3) - Combine Scenario #2 and Data filtering: In this scenario, we compared two meth- ods in the data filtering strategy: the Round-T rip BLEU [ 35 ] (#3.1) and our proposed sentence-level cosine similar - ity (#3.2) . W e experimented with two types of data selec- tion: TF-IDF (#3.1 and #3.2) and combination score(#3.3). Scenario gr oup #4 (#4.1 to #4.2) - Combine Scenario #1 and Data augmentation via English pivot language: W e compared ”standard” and ”aligned” versions to generate an augmented dataset. The Google T ranslator API is used for the translation task. Scenario gr oup #5 (#5.1 to #5.2) - Combine Scenarios #3 and #4: W e created a new training dataset through the best settings from Scenarios #3 and #4 . Scenario gr oup #6 (#6.1 to #6.2)- Combine Scenario #5 and Data Generation: In this experiment, at the back- translation step, each sentence from the V ietnamese dataset was translated into k corresponding Khmer candidate sen- tences. Then these sentences were filtered and combined with the original bilingual dataset to create a new training dataset. 6 Experimental r esults This section presents a comprehensive evaluation of our system performance under various scenarios and compares the best results with other relevant research. The analysis of the augmented data’ s quality is provided in Appendix 1. 6.1 Analysis our system performance using differ ent scenarios W e evaluated our dif ferent scenarios on a test set with 2,000 parallel sentence pairs. The results of our scenarios are pre- 9 https://vnexpress.net 10 https://dantri.com.vn 1 1 https://pypi.or g/project/langdetect Khmer -V ietnamese Neural Machine T ranslation… Informatica 47 (2023) 349–360 355 T able 2: Experimental results Data Augmentation Methods Scenario Name Back-translation via English BLEU (%) Data Selection Decoding Strategy Data Filtering pivot language #1 Baseline model - - - - 52.32 #2.1 #1 + Back-translation Randomness Beam search - - 53.16 #2.2 #1 + Back-translation Randomness Sampling - - 53.49 #2.3 #1 + Back-translation TF-IDF Beam search - - 53.83 #2.4 #1 + Back-translation TF-IDF Sampling - - 53.96 #2.5 #1 + Back-translation Cosine similarity Sampling - - 53.98 #2.6 #1 + Back-translation Combination Scor e Sampling - - 54.08 #3.1 #2 + Data Filtering TF-IDF Sampling Round-T rip BLEU - 54.27 #3.2 #2 + Data Filtering TF-IDF Sampling Cosine similarity - 54.38 #3.3 #2 + Data Filtering Combination Scor e Sampling Cosine similarity - 54.48 #4.1 #1 + Data Augmentation - - - Standard 52.98 #4.2 #1 + Data Augmentation - - - Aligned 53.29 #5.1 #3 + #4 TF-IDF Sampling Cosine similarity Standard 54.51 #5.2 #3 + #4 Combination Scor e Sampling Cosine similarity Aligned 54.93 #6.1 #5 + Data Generation Combination Score Sampling Cosine similarity Standard 55.13 #6.2 #5 + Data Generation Combination Scor e Sampling Cosine similarity Aligned 55.37 sented in T able 2 . The baseline Scenario #1 achieved a 52.32% BLEU score. Scenario gr oup #2 shows that the combination score gave the best results and the sampling decoding method is better than the beam search method. T able 3: Ef fect of BLEU filtering threshold in the data filtering using round-trip BLEU in the scenario #3. Scenario Thr eshold BLEU (%) #3 10 54.02 #3 15 54.27 #3 20 54.16 #3 25 53.80 For scenario gr oups #3 , first, we evaluated the ef fect of data filtering thresholds to the system’ s performance. T a- bles 3 and 4 show that the BLEU score increases when the filter threshold is increased, but up to a certain threshold, and then reduced. This means that as the filter thresholds increase, we can filter out more low-quality parallel sen- tence pairs in the synthetic bilingual dataset, but the size of this dataset decreases. The best thresholds were then ap- plied for all scenarios in groups #3 in order to compare the system performance with other scenarios in T able 2 . Scenario #4 First, in the standard version, we evaluated the model’ s performance with dif ferent augmented sizes. The original bilingual dataset was combined with 30000, 50000, and 70000 augmented sentence pairs created by the data augmentation via the English pivot language to form three training datasets. The obtained BLEU scores grad- ually increased from 52.48%, 52.52%, to 52.98%, propor - tional to the enhanced data size. The best result using 70000 augmented sentence pairs was used to compare with other scenarios in T able 2 (Scenario #4.1). Scenario #4.2 also used 70000 augmented sentence pairs in the aligned ver - sion. T able 4: Ef fect of the cosine filtering threshold in the data filtering using sentence-level cosine similarity in the scenario #3 . Scenario Thr eshold BLEU (%) #3 0.5 54.02 #3 0.6 54.36 #3 0.7 54.38 #3 0.8 53.92 W ith a result of 54.93% BLEU score, Scenario #5 shown the ef fectiveness when combined the best synthetic parallel datasets from Scenario #3 and 30,000 pair sentences aug- mented in Scenario #4. Finally , Scenario #6 , we incorporated Scenario #5 with our generation strategy to get 55.37% BLEU points, which improved 3.05% BLEU scores compared to the baseline model. The results shown that the process of generating a synthetic dataset based on only one candidate with the highest probability was not enough. T aking k candidates and evaluating them helped us to retain more suitable can- didates. 6.2 Comparison with other models In addition to our scenario results above, we compared our best result with some models: Google T ranslator 12 , pre- trained multilingual seq2seq models, including mBAR T50 [ 37 ], m2m100-1.2B [ 41 ], and nllb-∗ [ 42 ]-a multilingual translation model introduced by the Facebook AI 13 re- cently . The results shown in T able 5 indicated that our best model achieved best results for translating from the Khmer language to the V ietnamese one. In addition, our current 12 https://github.com/nidhalof f/deep-translator 13 https://ai.facebook.com/ 356 Informatica 47 (2023) 349–360 T . N. Quoc et al. approach had a better performance than our previous model [ 18 ] with 0.86% BLEU score higher . T able 5: Comparison our system results to other models Models BLEU (%) facebook/mbart50 12.74 facebook/m2m100-1.2B 22.44 facebook/nllb-200-distilled-600M 32.48 facebook/nllb-200-distilled-1.3B 36.51 facebook/nllb-200-3.3B 37.81 Google T ranslator 50.07 Our previous work [ 18 ] 54.51 Our best model 55.37 7 Conclusions This research presents an approach to address the low- resource challenge in Khmer -V ietnamese NMT . The pro- posed method utilizes the pretrained multilingual model mBAR T as the foundation for the MT system, comple- mented by various data augmentation strategies to enhance system performance. These augmentation strategies en- compass back-translation, data augmentation through an English pivot language, and synthetic data generation. The highest performance is achieved when combining the afore- mentioned augmentation methods with ef fective data se- lection and data filtering strategies, resulting in a sig- nificant 3.05% increase in BLUE score compared to the baseline model utilizing mBAR T with the original dataset. Our proposed approach outperforms the Google T ranslator model by 5.3% BLEU score on a test set of 2,000 Khmer - V ietnamese sentence pairs. Future work involves applying our proposed approach to other low-resource language pairs to demonstrate its generalizability . Refer ences [1] T . Khanna, J. N. W ashington, and et al. Recent advances in apertium, a free/open-source rule-based machine translation platform for low-resource lan- guages. Machine T ranslation , Dec 2021. https: //doi.org/10.1007/s10590- 021- 09260- 6 . [2] P . Koehn, F . J. Och, and et al. Statistical phrase-based translation. In In Pr oceedings of NAACL , page 48– 54, 2003. https://doi.org/10.3115/1073445. 1073462 . [3] P . Koehn, H. Hoang, and et al. Moses: Open source toolkit for statistical machine translation. pages 177–180. Association for Computational Linguis- tics, 2007. https://doi.org/10.3115/1557769. 1557821 . [4] K. Cho, B. Merriënboer , and et al. On the properties of neural machine translation: Encoder–decoder ap- proaches. In Pr oceedings of EMNLP , pages 103–1 1 1, 2014. https://doi.org/10.3115/v1/w14- 4012 . [5] D. Suleiman, W . Etaiwi, and A. A wajan. Recurrent neural network techniques: Emphasis on use in neural machine translation. In Informatica , 2021. https: //doi.org/10.31449/inf.v45i7.3743 . [6] Y . T ian, S. Khanna, and A. Pljonkin. Research on machine translation of deep neural network learn- ing model based on ontology . In Informatica , 2021. https://doi.org/10.31449/inf.v45i5.3559 . [7] S. Edunov , M. Ott, and et al. Understanding back- translation at scale. In Pr oceedings of EMNLP , pages 489–500, 2018. https://doi.org/10.18653/v1/ d18- 1045 . [8] A. V aswani, N. Shazeer , and et al. Attention is all you need. In Advances in Neural Information Pr ocessing Systems , volume 30. Curran Associates, Inc., 2017. https://doi.org/10.48550/arXiv. 1706.03762 . [9] Y . Chen, Y . Liu, and et al. A teacher -student frame- work for zero-resource neural machine translation. In Pr oceedings of ACL (V olume 1: Long Papers) , pages 1925–1935, 2017. https://doi.org/10.18653/ v1/p17- 1176 . [10] Y . Kim, P . Petrov , and et al. Pivot-based trans- fer learning for neural machine translation between non-English languages. In Pr oceedings of EMNLP- IJCNLP , pages 866–876, 2019. https://doi.org/ 10.18653/v1/d19- 1080 . [1 1] R. Sennrich, B. Haddow , and et al. Improving neu- ral machine translation models with monolingual data. In Pr oceedings of ACL (V olume 1: Long Papers) , pages 86–96, 2016. https://doi.org/10.18653/ v1/p16- 1009 . [12] J. Zhang and C. Zong. Exploiting source-side mono- lingual data in neural machine translation. In Pr o- ceedings of EMNLP , pages 1535–1545, 2016. https: //doi.org/10.18653/v1/d16- 1160 . [13] L. Mike, L. Y inhan, and et al. BAR T : Denois- ing sequence-to-sequence pre-training for natural lan- guage generation, translation, and comprehension. In ACL , 2020. https://doi.org/10.18653/v1/ 2020.acl- main.703 . [14] C. Raf fel, N Shazeer , A. Roberts, and et al. Exploring the limits of transfer learning with a unified text-to- text transformer . The Journal of Machine Learning Resear ch , 21(1):5485–5551, 2020. https://doi. org/10.48550/arXiv.1910.10683 . Khmer -V ietnamese Neural Machine T ranslation… Informatica 47 (2023) 349–360 357 [15] Y . Liu, J. Gu, N. Goyal, and et al. Multilingual denois- ing pre-training for neural machine translation. T rans- actions of ACL , 8:726–742, 2020. https://doi. org/10.1162/tacl_a_00343 . [16] V an-V inh Nguyen, , Huong Le-Thanh, and et al. KC4MT : A high-quality corpus for multilingual ma- chine translation. In Pr oceedings of LREC , page 5494–5502, 2022. [17] N. H. Quan, N. T . Dat, N. H. M. Cong, and et al. V iNMT : Neural machine translation toolkit, 2021. https://doi.org/10.48550/arXiv.2112. 15272 . [18] V .H Pham and Le T .H. Improving khmer -vietnamese machine translation with data augmentation methods. In Pr oceedings of SoICT ’22 , pages 276–282, 2022. https://doi.org/10.1145/3568562.3568646 . [19] J. Devlin, M. Chang, and et al. BER T : pre-training of deep bidirectional transformers for language un- derstanding. In Pr oceedings of NAACL: Human Lan- guage T echnologies , pages 4171–4186, 2019. http: //doi.org/10.18653/v1/n19- 1423 . [20] J. Zhu, Y . Xia, L W u, and et al. Incorporating bert into neural machine translation, 2020. https:// openreview.net/forum?id=Hyl7ygStwB . [21] S. Rothe, S. Narayan, and et al. Leveraging pre- trained checkpoints for sequence generation tasks. T ransactions of ACL , 8:264–280, 2020. https:// doi.org/10.1162/tacl_a_00313 . [22] B. Zoph, D. Y uret, and et al. T ransfer learning for low- resource neural machine translation. In Pr oceedings of EMNLP , pages 1568–1575, 2016. https://doi. org/10.18653/v1/d16- 1163 . [23] J. Hu, L. Zhang, and D. Y u. Improved neural machine translation with paraphrase-based synthetic data. In Pr oceedings of NAACL , 2019. [24] X. Niu and et al. Subword-level word-interleaving data augmentation for neural machine translation. In Pr oceedings of EMNLP , 2018. [25] Z. Liu and et al. W ord deletion data augmentation for low-resource neural machine translation. In Pr oceed- ings of ACL , 2021. [26] H. W ang and et al. Multi-objective data augmenta- tion for low-resource neural machine translation. In Pr oceedings of IJCAI , 2019. [27] C. Chu and et al. Domain adaptation for neural ma- chine translation with limited resources. In Pr oceed- ings of EMNLP , 2020. [28] M. Johnson, M. Schuster , and et al. Google’ s multi- lingual neural machine translation system: Enabling zero-shot translation. T ransactions of ACL , 5:339– 351, 2017. https://doi.org/10.1162/tacl_a_ 00065 . [29] R. C. Moore and W . Lewis. Intelligent selection of language model training data. pages 220–224. Pro- ceedings of ACL, 2010. https://aclanthology. org/P10- 2041 . [30] M. W ees, A. Bisazza, and et al. Dynamic data se- lection for neural machine translation. In Pr oceed- ings of EMNLP , pages 1400–1410, 2017. https: //doi.org/10.48550/arXiv.1708.00712 . [31] R. W ang, A. Finch, and et al. Sentence embedding for neural machine translation domain adaptation. In Pr oceedings of ACL , pages 560–566, 2017. https: //doi.org/10.18653/v1/p17- 2089 . [32] S. Zhang and D. Xiong. Sentence weighting for neural machine translation domain adaptation. In Pr oceed- ings of the 27th International Confer ence on Compu- tational Linguistics , pages 3181–3190, August 2018. https://aclanthology.org/C18- 1269 . [33] C. C. Silva, C. Liu, and et al. Extracting in-domain training corpora for neural machine translation us- ing data selection methods. In Pr oceedings of the Thir d Confer ence on Machine T ranslation , pages 224–231, 2018. https://doi.org/10.18653/v1/ w18- 6323 . [34] A. Poncelas and et al. Data selection with fea- ture decay algorithms using an approximated tar get side. 2018. https://doi.org/10.48550/arXiv. 1811.03039 . [35] A. Imankulova, T . Sato, and et al. Improving low-resource neural machine translation with filtered pseudo-parallel corpus. pages 70–78. Asian Federa- tion of Natural Language Processing, 2017. https: //aclanthology.org/W17- 5704 . [36] P . Koehn, H. Khayrallah, and et al. Findings of the WMT 2018 shared task on parallel corpus filtering. In Pr oceedings of the Thir d Confer ence on Machine T ranslation , pages 726–739, 2018. http://doi. org/10.18653/v1/w18- 6453 . [37] Y . T ang, C. T ran, X. Li, and et al. Multilingual trans- lation with extensible multilingual pretraining and finetuning, 2020. https://doi.org/10.48550/ arXiv.2008.00401 . [38] J. Cho, E. Jung, and et al. Improving bi-encoder doc- ument ranking models with two rankers and multi- teacher distillation. In Pr oceedings of SIGIR ’21 , page 2192–2196, 2021. https://doi.org/10.1145/ 3404835.3463076 . 358 Informatica 47 (2023) 349–360 T . N. Quoc et al. [39] N. Reimers and I. Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation, 2020. https://doi.org/10.48550/ arXiv.2004.09813 . [40] K. Papineni, S. Roukos, and et al. Bleu: a method for automatic evaluation of machine translation. In Pr oceedings of ACL , pages 31 1–318, 2002. http: //doi.org/10.3115/1073083.1073135 . [41] A. Fan, S. Bhosale, H. Schwenk, and et al. Be- yond english-centric multilingual machine transla- tion, 2020. https://doi.org/10.48550/arXiv. 2010.11125 . [42] NLLB T eam. No language left behind: Scaling human-centered machine translation, 2022. https: //doi.org/10.48550/arXiv.2207.04672 . A Appendix 1 T o assess the quality of the augmented data, we present ex- emplary outputs from two methods described in Section 4 in T ables 6 and 7 . T able 6 exhibits examples generated by the back- translation method. V i-Km sentence pairs in the first and second columns are added to the augmented training dataset if they pass the synthetic data filtering step. The ta- ble reveals that the NMT models employed in the back- translation process may still produce semantically incor - rect sentences, particularly when translating proper names. Such sentences are subsequently filtered out during the data filtering process. Notably , no sentence pairs in the aug- mented dataset by this method exhibit poor quality . In T able 7 , we present examples of data augmentation via the English pivot language. Due to the relatively high qual- ity of Google T ranslator , the augmented V i-Km sentence pairs demonstrate a relatively high quality when the original Km-V i sentence pair possesses good quality . However , dis- crepancies arise when the original V ietnamese-Khmer sen- tence pairs do not maintain complete semantic equivalence, leading to a similar outcome for the newly generated Khmer sentence and the original V ietnamese one. Consequently , in such instances, the data filtering step excludes the incorpo- ration of the new sentence pair into the augmented dataset. Khmer -V ietnamese Neural Machine T ranslation… Informatica 47 (2023) 349–360 359 T able 6: Output examples of synthetic data generation process V i sentence (V ietnamese Mono- lingual Dataset) Km sentence generated by the V i-Km model V i sentence generated by the Km-V i model Action V iệt Nam đã thâm nhập và mở rộng thương mại tại thị trường này ./ V ietnam has penetrated and ex- panded its trade in this market. េវៀតណាម បាន េ្រជ�ត ច ូ ល ន ិ ង ព្រង ី ក ពាណ ិ ជ្ជកម្ម ក្ន ុ ង ទ ី ផ�រ េនះ ។ / V ietnam has penetrated and expanded trade in this market. V iệt Nam đã tham gia và mở rộng thương mại tại thị trường này ./ V ietnam has joined and expanded trade in this market. Keep Đoàn đại biểu kiều bào đã đến dâng hương ở tượng đài V ua Lê./ The overseas V ietnamese delegation came to offer incense at the statue of King Le. គណៈ្របត ិ ភ ូ ម កព ី ្របេទស េវៀតណាម បាន មក ប ួ ងស ួ ង េន រ ូ បស ំ ណាក េស ្ត ច Li Lei ។ / A delegation fr om V ietnam came to pray in the statue of King Li Lei. Một phái đoàn từ V iệt Nam đã đến thăm các khu vực của Hoàng gia Li Lei./ A delegation fr om V ietnam visited the ar eas of Royal Li Lei. Filter out Theo đó, các dụng cụ này dao động mức từ vài chục cho đến hàng chục triệu đồng./ Accor dingly , these tools range fr om a few tens to tens of millions of dong. តាមរយៈ េនះ ឧបករណ ៍ ទា ំ ងេនះ មាន តៃម្ល ព ី ម ួ យ ដង េទ ម ួ យ ដង េទ ម ួ យ ដង ។ / Thr ough this, these devices ar e priced fr om time to time. Bằng cách này , những thiết bị này có giá trị một lần, một lần, một lần./ This way , these devices ar e worth it once, once, once. Filter out T able 7: Output examples of data augmentation process via english pivot language Original Km sentence Original V i sentence Km augmented sentence Action កញ ្ច ប ់ ទ ិ ន្នន ័ យ ្រត�វបាន តេ្រម�ប តាម ត ំ បន ់ ស្រមាប ់ អ្នកទ ិ ញ ងាយ�ស�ល េ្រជ ើ សេរ ើ ស ។ / The data packets ar e sorted by ar ea for the buyer to easily select. Các gói data được chia ra theo khu vực để người mua dễ dàng lựa chọn./ The data packages ar e divided by r e- gion for buyers to easily choose កញ ្ច ប ់ ទ ិ ន្នន ័ យ ្រត�វបាន ចាត ់ ថា ្ន ក ់ តាម ត ំ បន ់ េដ ើ ម្បី ងាយ�ស�ល េ្រជ ើ សេរ ើ ស អ្នកទ ិ ញ ។ / Data packages ar e catego- rized by ar ea for easy selection of buyers. Keep មន ុ ស្ស ្របមាណ ៥០០ លាន នាក ់ អាច ្របឈម ន ឹ ង ភាព្រក ី ្រក េដាយសារ វ ិ បត្ត ិ េសដ្ឋក ិ ច្ច ដ ៏ ធ្ងន ់ ធ្ងរ ប ំ ផ ុ ត តា ំ ងព ី ម ុ ន មក ។ / An estimated 500 million people could face poverty due to the worst economic crisis ever . Thế giới đang đối mặt với cuộc suy thoái kinh tế sâu sắc nhất, được đánh giá là nghiêm trọng hơn các cuộc khủng hoảng trước đây ./ The world is facing the deepest economic r ecession, which is consider ed to be mor e sever e than pr evious crises. ព ិ ភពេលាក ក ំ ព ុ ង ្របឈមម ុ ខ ន ឹ ង វ ិ បត្ត ិ េសដ្ឋក ិ ច្ច ដ ៏ េ្រជ ប ំ ផ ុ ត ែដល ្រត�វបាន េគ ចាត ់ ទ ុ ក ថា ធ្ងន ់ ធ្ងរ ជាង វ ិ បត្ត ិ ម ុ ន ៗ / The world is facing the deepest economic r e- cession, which is consider ed to be mor e se- ver e than pr evious crises. Filter out 360 Informatica 47 (2023) 349–360 T . N. Quoc et al.