247 Adapting an English Corpus and a Question Answering System for Slovene Uroš ŠMAJDEK Faculty of Computer and Information Science, University of Ljubljana Matjaž ZUPANIČ Faculty of Computer and Information Science, University of Ljubljana Maj ZIRKELBACH Faculty of Computer and Information Science, University of Ljubljana Meta JAZBINŠEK Faculty of Arts, University of Ljubljana Developing effective question answering (QA) models for less-resourced lan- guages like Slovene is challenging due to the lack of proper training data. Mod- ern machine translation tools can address this issue, but this presents anoth- er challenge: the given answers must be found in their exact form within the given context since the model is trained to locate answers and not generate them. To address this challenge, we propose a method that embeds the an- swers within the context before translation and evaluate its effectiveness on the SQuAD 2.0 dataset translated using both eTranslation and Google Cloud translator. The results show that by employing our method we can reduce the rate at which answers were not found in the context from 56% to 7%. We then assess the translated datasets using various transformer-based QA models, examining the differences between the datasets and model configurations. To ensure that our models produce realistic results, we test them on a small Šmajdek, U., Zupanič, M., Zirkelbach, M., Jazbinšek, M. Adapting an English Corpus and a Question Answering System for Slovene. Slovenščina 2.0, 11(1): 247–274. 1.01 Izvirni znanstveni članek / Original Scientific Article DOI: https://doi.org/10.4312/slo2.0.2023.1.247-274 https://creativecommons.org/licenses/by-sa/4.0/ 248 Slovenščina 2.0, 2023 (1) | Articles subset of the original data that was human-translated. The results indicate that the primary advantages of using machine-translated data lie in refining smaller multilingual and monolingual models. For instance, the multilingual CroSloEngual BERT model fine-tuned and tested on Slovene data achieved nearly equivalent performance to one fine-tuned and tested on English data, with 70.2% and 73.3% questions answered, respectively. While larger mod- els, such as RemBERT, achieved comparable results, correctly answering questions in 77.9% of cases when fine-tuned and tested on Slovene com- pared to 81.1% on English, fine-tuning with English and testing with Slovene data also yielded similar performance. Keywords: question answering, machine translation, multilingual models 1 Introduction One of the goals of artificial intelligence is to build intelligent systems that can interact with humans and help them. One such task is reading the web and then answering complex questions about any topic with regard to the given context. These question answering systems could have a big impact on the way that we access information. Furthermo- re, open-domain question answering is a benchmark task in the deve- lopment of artificial intelligence, since understanding text and being able to answer questions about it is something that we generally asso- ciate with intelligence. Question answering (QA) is one of the disciplines in the broader field of natural language processing (NLP), which involves the automa- tic answering of questions posed in natural language. Thus, the goal of QA is the development of automated systems that can understand and respond to human questions in a way that is similar to how humans answer questions. The QA task is typically formulated as follows: given a question and a context, the system must identify the answer to the question within the given context. Recently, pre-trained contextual embedding (PCE) models like Bi- directional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) have attracted plenty of attention due to their good perfor- mance in a wide range of NLP tasks, including QA. Compared to earlier 249 Adapting an English Corpus and a Question Answering System for Slovene information retrieval and knowledge-based systems, modern QA sy- stems are significantly less domain-dependent, as they do not require a specifically tailored database to function effectively. This has thus led to the development of multilingual question answering systems, where the same system can serve a multitude of languages. However, multilingual QA tasks typically assume that answers exist in the same language as the question, and require a smaller corpus to fine-tune it for a given language and a broader domain (e.g. Wiki- pedia articles). Yet in practice, many languages face both information scarcity, where languages have few reference articles, and information asymmetry, where questions reference concepts from other cultures. Due to the sizes of modern corpora, performing human translations is generally not feasible, and therefore we often employ machine tran- slations instead. However, machine translation has trouble interpreting the nuances of specific languages, such as culturally specific vocabula- ry (e.g. translating bird sanctuary as “ptičje zatočišče”, where the cor- rect translation is “ptičji rezervat”), the use of articles, proper nouns, abbreviations, and implicit relationships between words (Koehn and Knowles, 2017; Arnejšek and Unk, 2020). This is especially problema- tic in question answering, where the answer has to be found in its exact form within the context to be usable for training such a model. The objective of our work is thus to reduce the impact of errors in the construction of a machine-translated (MT) dataset that can be used to both fine-tune and test a question answering (QA) model. Spe- cifically, we focus on the translation of the popular SQuAD 2.0 (Rajpur- kar et al., 2018) QA dataset. Moreover, we benchmark the accuracy of QA models fine-tuned using the proposed MT dataset by assessing them on a human-translated (HT) subset of the original data. The main contributions of our work are: • a pipeline for translation of an English QA dataset; • performance comparison of the various monolingual and multilin- gual QA models fine-tuned on the original dataset and the English- -to-Slovene MT datasets; • comparison of the eTranslation and Google Cloud Translation ser- vices in terms of raw translation and QA performance using the data translated from English to Slovene; and 250 Slovenščina 2.0, 2023 (1) | Articles • evaluation of the QA performance of the resulting QA models on the HT subset. This paper is a follow-up to our submission to JDTH 2022 (Zupanič et al., 2022). To improve upon the presented concept, we expanded our evaluation to include the corpus translated by the state-of-the-art Google Cloud Translation (Google CT) service to assess the impact of the quality of translations. In addition, to ensure the testing set is not influenced by the machine translation, we replaced the post-edited machine translation samples with a fully human-translated testing set. Lastly, we also experimented with additional model parameters during evaluation and improved the presentation of our method. In Section 2 we present the related work. In Section 3 we present our dataset, the process of translation, and evaluate the quality of the translation. In Section 4 we give a brief overview of the models used in the evaluation. In Section 5 we present the evaluation and discuss the results in Section 6. In Section 7 we present the conclusions and give possible extensions and enhancements for future work. 2 Related work Early question answering systems, such as LUNAR (Woods & WOODS, 1977), date back to the 1960s and the 1970s. They were characterized by a core database and a set of rules, both handwritten by experts of the chosen domain. Over time, with the development of large online text repositories and increasing computer performance, the focus shifted from such rule-based systems to using machine learning and statistical approaches, like Bayesian classifiers and Support Vector Machines. An example of this kind of system that was able to perform que- stion answering in the Slovene language was presented by Čeh and Ojsteršek (2009), where the authors used classification and answer retrieval in parallel. The system retrieved data from its own database, consisting of MS Excel files, local databases, and integrated web ser- vices. For question classification, they used Support Vector Machines. The problem was that the system was very limited in question answe- ring, able to answer only specific predefined classes of questions. 251 Adapting an English Corpus and a Question Answering System for Slovene Another major revolution in the field of question answering and na- tural language processing, in general, was the advent of deep learning approaches and self-attention. One of the most popular approaches of this kind is BERT (Devlin et al., 2019), a transformer model introduced in 2018. Since then it has inspired many other transformer-based mo- dels, such as RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2020), and T5 (Raffel et al., 2020), XLM (Lample and Conneau, 2019) and XLNet (Yang et al., 2019). Such models also have the advantage of being able to recognize multiple languages, giving rise to multilingual models and model vari- ants, such as Multilingual BERT, XLM-RoBERTa (Conneau et al., 2020), mT5 (Xue et al., 2021) and RemBERT (Chung et al., 2021). Neverthe- less, the training requires large amounts of training data, which many languages lack, leading to varying performance between different lan- guages. They have also been shown to perform worse than monolingu- al models (Martin et al., 2020; Virtanen et al., 2019). Ulčar and Robnik- -Šikonja (2020) thus made an effort to strike a middle ground between the performance of monolingual models and the versatility of multi- lingual ones by reducing the number of languages in the multilingual model to three – two similar less-resourced languages from the same language family and English. This resulted in two trilingual models, Fi- nEst BERT and CroSloEngual BERT. In 2020, a Slovene monolingual RoBERTa-based model called Slo- BERTa (Ulčar et al., 2021) was introduced. It was trained on five dif- ferent corpora, totalling 3.41 billion words. The latest version of the model is SloBERTa 2.0, augmenting the original model by more than doubling the number of training iterations. The authors evaluated its performance on named-entity recognition, part-of-speech tagging, de- pendency parsing, sentiment analysis, and word analogy, but not on question answering. While the described advances of natural language processing mo- dels already offer us a partial solution for the lack of language-specific training corpora, namely the ability to train the model on a language where large corpora are present (e.g. English), the models still requi- re language-specific fine-tuning, for which a sizable corpus is needed. In our work, we present a potential solution to this problem, by using 252 Slovenščina 2.0, 2023 (1) | Articles machine-translation methods to translate smaller corpora to Slovene, and then fine-tune and evaluate the results. 3 Dataset description and methodology In this section, we describe the dataset used in our study and the me- thodology employed to create machine- and human-translated data- sets for fine-tuning and testing the question answering models. 3.1 Stanford Question Answering Dataset (SQuAD 2.0) SQuAD 2.0 (Rajpurkar et al., 2018) is a reading comprehension data- set. It is based on a set of articles on Wikipedia that cover a variety of topics, from historical, pharmaceutical, and religious texts to texts about the European Union. Every question in the dataset is associated with a segment of text, or span, from the corresponding reading passa- ge. It consists of over 100,000 question-answer pairs extracted from over 500 articles. Unlike SQuAD 1.0, the dataset contains roughly twice as much data, and also includes unanswerable questions, which are designed to look similar to answerable ones, but lack an answer within the given text. Thus, for a system to perform well on SQuAD2.0, it must not only answer questions, when possible but also determine when no Figure 1: An overview of the machine translation pipeline. 253 Adapting an English Corpus and a Question Answering System for Slovene answer is supported by the paragraph and abstain from answering. An example of a question-answer pair from the dataset is: • Question: What is the name of the state referred to by historians during the Middle Ages as the Eastern Roman Empire? • Context: During the Middle Ages, the Eastern Roman Empire sur- vived, though modern historians refer to this state as the Byzantine Empire... • Answer: the Byzantine Empire 3.2 Machine translation In this subsection, we will describe the proposed machine-translation pipeline, a brief overview of which can be found in Figure 1. To tran- slate the dataset into Slovenian we used two translation web services: eTranslation (European Commission, 2020) and Google CT. Due to the web services being primarily designed to translate webpages and short documents in DOCX or PDF format, our translation pipeline design was as follows: 1. Convert the corpus into HTML format. We wrapped context-que- stion-answer coupling in separate HTML tags and placed them wi- thin a hierarchy resembling that in the original dataset. HTML tag attributes were used to preserve the unique question and answer identifiers for later evaluation. An example of the resulting structu- re can be seen in Appendix B. 2. Split the HTML file into smaller chunks. We found that 4 MB chunks work best, as larger chunks were often unable to be translated. 3. Send chunks to the translation service. 4. Use the original corpus file to compose the translated document in the original format. The requirement for a context-question-answer coupling to be used to train a question answering model is that the answer has to be found in its exact form within the context. For example, if the answer entity is Bizantinsko cesarstvo, but the relevant part of the context was translated as …v Bizantinskem cesarstvu..., such a question-answer pair is unusable. This is due to the model’s task being to find the index 254 Slovenščina 2.0, 2023 (1) | Articles of an answer to the question within the context. To improve upon basic translation, we employed two different methods. The first was to correct the answers by breaking down both the answer and the context into lemmas and searching for the lemmatized form of the answer in the lemmatized form of the context. To accompli- sh this, the CLASSLA (CLARIN Knowledge Centre for South Slavic lan- guages) language processing pipeline (Ljubešić and Dobrovoljc, 2019) was used. If a match was found, we replaced the bad answer with the matching original text in the context. The second method was to embed the answers in the context be- fore translation. This was done by replacing the answer entry of the untranslated document with a copy of the context entry which had the answer marked by a common HTML tag and a unique attribute to avo- id mistaking it for a preexisting tag within the context. This allows the translation to also take into account the context surrounding the an- swer, greatly increasing the chance such an answer will be found in the original context. As the locations of answers within contexts are given by the dataset, finding the correct context entry is a trivial operation. For example, the untranslated answer entry ‘the Byzantine Empire’ was replaced with the following: ‘During the Middle Ages, the Eastern Ro- man Empire survived, though modern historians refer to this state as the Byzantine Empire...’. 3.3 Human translation When we added the Google CT service, we replaced the post-editing with the completely human translation of the excerpts so that the end comparison would be more objective and of better quality. The transla- tion was done on a small number of automatically translated excerpts chosen randomly due to limited human resources. The provided excerpts included original paragraphs or contexts, que- stions, and answers. Firstly, we translated the paragraphs and then the questions and answers since the answers had to match the text in the pa- ragraph. As mentioned above in the description of the dataset, the topics of the original texts were very diverse and technical, covering different domains such as religion, history, politics, mathematics, and chemistry. 255 Adapting an English Corpus and a Question Answering System for Slovene In total, there were 30 translated contexts with accompanying an- swerable and unanswerable questions, as well as impossible questions. The exact number of different segment types can be seen in Table 1. Table 1: Statistics for manually translated data subset Segment content Number of instances Context 30 Answerable question 142 Answer 435 Impossible question 143 Total number 750 4 Models In this section, we present each of the five models that were used in the evaluation (Table 2). Since those transformer models are usable for a various number of natural language tasks we used Hugging Face’s question answering pipeline to infer with question answering models.1 The following models were selected as they are diverse in terms of pro- perties and are publicly available, well documented, and have shown promising performance figures in the past. Table 2: Used models with the respective properties Model Name Trained languages No. of training tokens No. of hidden layers SloBERTa 1 3.47 billion 12 CroSloEngual BERT 3 5.9 billion 12 Multilingual BERT 104 No data 12 XLM-RoBERTa 100 6.3 trillion 24 RemBERT 110 1.8 trillion 32 4.1 SloBERTa SloBERTa (Ulčar & Robnik-Šikonja, 2021) is a Slovene monolingual large pre-trained masked language model. It is closely related to the French CamemBERT model, which is similar to the base RoBERTa with 1 https://huggingface.co/docs/transformers/tasks/question_answering 256 Slovenščina 2.0, 2023 (1) | Articles 12 hidden transformer layers but uses a different tokenization model. Since the model requires a large dataset for training, it was trained on five combined datasets with 3.47 billion tokens. It outperformed exi- sting Slovene models. 4.2 CroSloEngual BERT CroSloEngual BERT (Virtanen et al., 2019) is a trilingual model based on a BERT base with 12 hidden transformer layers and trained for Slo- vene, Croatian, and English, using 5.9 billion tokens from these langu- ages. For those languages it performs better than multilingual BERT, which is expected, since studies show that monolingual models per- form better than large multilingual ones (Virtanen et al., 2019). 4.3 Multilingual BERT Multilingual BERT (M-BERT) (Devlin et al., 2019) is a version of BERT that has been trained on data from Wikipedia in 104 languages to satisfy the demand for multilingualism. It can perform tasks such as question answering, language classification, and many more for a wide range of languages. The model was released in large form with 24 hidden transformer layers and in base form with 12 hidden transformer layers. We used the latter one in the current work. 4.4 XLM-RoBERTa XLM-RoBERTa (XLM-R) (Conneau et al., 2020) is a pre-trained cross- -lingual language model developed by Facebook AI. It is trained on 2.5 TB of CommonCrawl data, with a total of 6.3 trillion tokens in 100 languages, and based on RoBERTa (Robustly Optimized BERT Pretraining Approach) (Lample and Conneau, 2019). XLM-R, like M- -BERT, uses a similar pretraining objective. However, XLM-R has a larger model size, shared vocabulary, and is trained using more trai- ning data from the web. XLM-R large, which we used in our work, has 24 hidden layers. 257 Adapting an English Corpus and a Question Answering System for Slovene 4.5 RemBERT RemBERT (Chung et al., 2021) is a pre-trained multilingual model using a masked language modelling (MLM) objective. This model is pre-trai- ned on 1.8 trillion tokens in 110 languages and is similar to mBERT. However, it differs in that its input and output embeddings are not tied. Instead, it uses small input embeddings and larger output embeddin- gs, which makes the model more efficient because the output embed- dings are discarded during fine-tuning. RemBERT, which has 32 hidden transformer layers, is the largest model we tested. 5 Evaluation and results In this section, we present the evaluation results of our machine tran- slation methods and the performance of the question answering mo- dels fine-tuned on the translated datasets. 5.1 Machine translation To evaluate the quality of different translation methods, we measured how many answers can still be found within their respective context in their exact form. The results for the eTranslation service can be seen in Table 3. The resulting number of valid questions for both translation services, compared with the original dataset, are presented in Table 4. Table 3: Results for basic translation, lemma correction (LC), and context embedded (CE) translation of SQuAD 2.0 dataset by eTranslation Basic LC CE LC+CE 44% 66% 93% 94% Note. The percentages represent the number of answers that can be found within their re- spective context in their exact form. 258 Slovenščina 2.0, 2023 (1) | Articles Table 4: Number of questions in the original SQuAD 2.0 dataset and our machine-translat- ed datasets Dataset Subset AQ AQ [%] IQ Total Original Train Test 86,821 5,928 66.6% 49.9% 43,498 5,945 130,319 11,873 eTranslation Train Test 81,884 5,735 65.3% 49.1% 43,498 5,945 125,382 11,680 Google CT Train Test 84,048 5,821 65.9% 49.5% 43,498 5,945 127,546 11,766 Note. AQ denotes the number of answerable questions and IQ the number of impossible questions. 5.2 Question answering In order to evaluate the question answering performance of MT da- tasets obtained by eTranslation and Google CT, we first used both of them and the original English dataset, to fine-tune the following que- stion answering models: M-BERT, XLM-R, RemBERT, SloBERTa 2.0, CroSloEngual BERT. This yielded 14 different fine-tuned model confi- gurations, as showcased in the first two columns of Table 6. The fine- -tuned models were then evaluated in two stages described later in the section. All tests were performed on a system with an Intel Xeon E5-2687W v4 @ 3.00GHz CPU and RTX 3090 24GB GPU. Before the evaluation, we removed all punctuation, leading and trailing white spaces, and articles from both ground truth and prediction. Both of them were also set in lowercase. The parameters used for fine-tuning are presented in Table 5. The metrics used for the evaluation matched the official ones for the SQuAD 2.0 evaluation and were as follows: • Exact - The fraction of predictions that matched at least of one the correct answers exactly. • F1 - The average overlap between prediction and ground truth, de- fined as an average of F1 scores for individual questions. The F1 score of an individual question is computed as a harmonic mean of the precision and recall, where precision was defined as TM/TP, and recall as TM/TGP, where TM represents the maching tokens between 259 Adapting an English Corpus and a Question Answering System for Slovene prediction and ground trouth, TP number of tokens in prediction and TGP number of tokens in the ground truth. A token is defined as a word, separated by white space. In the first step of the evaluation, each of the models was tested using the HT subset of the original English testing set, and its untran- slated counterpart. Additionally, we also repeated each of the tests on MT testing subsets, in order to assess their viability to be used as te- sting sets. The F1 scores can be seen in Table 6, whereas a full result overview is presented in Appendix C, Table 10. In the second step of the evaluation, we repeated the tests with full original and MT testing sets to account for the potential discrepancies between the difficulty of the original dataset and the subset in the first set of experiments. The results of this step can be seen in Table 7 and Appendix C, Table 11. By comparing this set of results with the ones obtained in the first step we can see that the full set gives slightly better results, implying that the questions chosen in the subset were more difficult on average. Table 5: Parameters used to fine-tune the evaluated models Model Name B MS LR E XLM-R Large 4 265 1e-5 3 M-BERT 8 256 1e-5 3 CroSloEngual BERT 8 320 1e-5 3 RemBERT 4 256 1e-5 3 SloBERTa 2.0 8 320 1e-5 3 Note. B denotes the number of batches used during fine-tuning, MS the maximum sequence length, LR the learning rate, and E the number of epochs. 6 Discussion 6.1 Quantitative analysis By comparing the results of matching entries in Tables 6 and 7, we can observe that the results are consistently better when using the entire dataset instead of the randomly chosen subset for testing. The 260 Slovenščina 2.0, 2023 (1) | Articles differences are relatively minor though and both tables still show the same trends, which we would interpret as a positive indicator of the results in Table 6 being a good representation of the behaviour of the entire dataset. 6.1.1 Manual versus machine translation By comparing the results of tests using HT data of a model fine-tuned on MT data, with the same model fine-tuned on original data (Table 6) we can see that the impact of fine-tuning with MT datasets varies depending on the size and the inherent performance of the model. The largest performance gain from fine-tuning a multilingual model on MT data as opposed to original data can be observed with M-BERT (F1 score of 74.0% as opposed to 58.2%) and CroSloEngual BERT (F1 score of 73.6% as opposed to 65.1%). However, for the latter, this is only true when using the set translated by Google CT. The impact is less noticeable for the larger RemBERT (F1 score of 81.5% as oppo- sed to 79.5%) model, while worse for the XLM-R Large (F1 score of 81.4% as opposed to 81.6%) model. We reason this to be the result of the inherent ability of those two models to perform well when trai- ned and tested on two different languages. This is clearly visible in our case as these two models, unlike smaller ones, retained much of their ability to answer English questions even when fine-tuned on MT data. It is harder to evaluate the impact of using the MT fine-tuning set on SloBERTa 2.0 since we cannot benchmark it against the same model fine-tuned on the original data, but comparing its results to other similarly sized models – M-BERT and CroSloEngual BERT – we can see that it outperforms both of them, which we would consider as a positive indicator for the viability of using MT data to fine-tune the model. Taking all of this into account we would conclude that using MT data to fine-tune question answering models is a superior option to using original English data, especially if one is constricted to using small models. Additionally, it has the benefit of enabling the use of Slovene monolingual models, which tend to outperform multilingual models of the same size. 261 Adapting an English Corpus and a Question Answering System for Slovene 6.1.2 Comparison of translation services Comparing the results of the models fine-tuned with data translated by eTranslation and Google CT and tested on HT data in Table 6, we can observe that in most cases Google CT outperforms eTranslation. The exception to this is M-BERT, where eTranslation slightly outperformed its counterpart, but the difference is not significant. The magnitude of the differences varies from almost insignificant with M-BERT and XLM- -R Large, to noticeable with CroSloEngual BERT and RemBERT. We su- spect this is due to the inherent differences in the model structure and the data they were pre-trained on. By observing the results of different models and their variations when tested on the data translated by MT as opposed to the results obtained by testing on HT data (Table 6), we can see that the results vary significantly. We suspect that this is due to the various gramma- tical errors present in the MT data, as shown in Section 6.2, which are not present in the data that was used to pre-train the models, and thus the models have a harder time recognizing the structure of the context. This is further reinforced by the fact that models consistently yield bet- ter results when tested using data translated by Google CT as opposed to eTranslation (Tables 6 and 7) since the former contains fewer gram- matical and semantic mistakes (Section 6.2). Additionally, we can also observe a bias where models fine-tuned with one translation service perform better when tested on the same translation service as compa- red to the model fine-tuned with the other translation service tested on that testing set. All this points us toward concluding that while the MT datasets are a viable solution for fine-tuning they are less suitable for testing, especially if the resulting translation is of lower quality, such as when using the eTranslation service. 262 Slovenščina 2.0, 2023 (1) | Articles Table 6: Comparison of the F1 scores of various models and their fine-tuning configurations on the human-translated subset of SQuAD 2.0 (N=285), and the subsets containing the same question from the original English dataset and the two machine-translated datasets Model Name Fine-Tuning Dataset Original [%] eTranslation [%] Google CT [%] Human Transl. [%] M-BERT Original 78.4 48.2 55.9 58.2 eTranslation 62.6 64.5 73.4 74.0 Google CT 65.1 64.8 71.4 73.6 CroSloEngual BERT Original 75.5 60.8 63.4 65.1 eTranslation 63.1 58.8 64.5 63.6 Google CT 58.8 66.5 66.6 73.6 SloBERTa 2.0 eTranslation 65.0 72.2 76.1 74.9 Google CT 65.2 68.0 72.9 78.3 XLM-R Large Original 85.5 69.1 75.8 81.6 eTranslation 82.6 73.1 76.9 81.1 Google CT 82.3 70.9 77.4 81.4 RemBERT Original 87.2 71.4 74.3 79.5 eTranslation 84.1 72.9 76.6 78.6 Google CT 84.8 71.6 76.0 81.5 Note. Specific parameters used in fine-tuning are presented in Table 5. Table 7: Comparison of F1 scores of various models and their fine-tuning configurations on the English SQuAD 2.0 evaluation dataset and the two Slovene machine-translated SQuAD 2.0 evaluation datasets (N=11.680) Model Name Fine-Tuning Dataset Original [%] eTranslation [%] Google CT [%] M-BERT Original 78.9 59.2 61.9 eTranslation 68.2 68.3 70.7 Google CT 68.9 67.9 71.3 CroSloEngual BERT Original 76.3 63.5 66.8 eTranslation 68.2 65.5 68.3 Google CT 65.7 66.5 70.0 SloBERTa 2.0 eTranslation 64.7 73.7 76.8 Google CT 66.9 72.8 77.0 XLM-R Large Original 86.3 74.8 78.5 eTranslation 83.0 75.6 78.3 Google CT 84.4 75.5 80.1 RemBERT Original 87.5 71.4 74.3 eTranslation 83.9 72.9 76.6 Google CT 84.5 71.6 76.0 Note. The English dataset only contains the questions pre-set in its Slovene counterpart. Specific parameters used in fine-tuning are presented in Table 5. 263 Adapting an English Corpus and a Question Answering System for Slovene 6.2 Qualitative analysis of translations A comparison of the differences and the types of mistakes in the two machine translations and the human translation was made. The repre- sentativeness of these differences cannot be determined, but by loo- king at more examples some general mistakes of machine translations can be noted. It should also be considered that some of the mistakes of machine translation are more severe than others, and that in some segments there is a much greater number of them than in others, so the mistakes could not be counted exactly. Firstly, the segments with contexts are very long and this normally led to more grammar, syntactic and stylistic mistakes in machine translations. The eTranslation MT yielded the worst results, as can be seen in Appendix A, example 1, as there was a wrong gender agreement and a big seman- tical mistake (‘caving in’ translated as ‘jamarstvo’), which did not occur with Google CT. This was expected to a certain degree, as Google CT uses state-of-the-art translation methods, while eTranslation does not. Additi- onally, eTranslation is designed to perform best when working with texts on EU-related matters (European Commission, 2020) while our dataset is comprised of technical texts which cover a wide variety of topics. There was also a great dissimilarity between the translations of an- swerable and impossible questions, because machine translation pro- vided incoherent results. The changes are more notable because they affect the overall understanding of potential readers. These segments are shorter, but in both MT examples the word ‘plants’ was translated literally, so we can see in the example in the Appendix A that the HT translation is still the best one. Nevertheless, there was a larger num- ber of instances where Google CT performed better than eTranslation at the grammatical and syntactical levels. The segments with answers were the most similar ones, most pro- bably because they are shorter. The contextual mistakes in the answers were for the most part already corrected in the contexts. More severe mistakes include semantic mistakes (e.g. plants translated as ‘rastline’, not ‘naprave’) and completely wrong answers (e.g. empty segment in- stead of ‘Fermilab’ or ‘in’ instead of ‘1,388’). Some frequent mistakes also occurred in translations of the names of movements, books, pro- jects, or other names (e.g. ‘Bricks for Varšava’ was left untranslated by 264 Slovenščina 2.0, 2023 (1) | Articles eTranslation MT, Google CT did translate it to ‘Opeke za Varšavo’ and was changed in HT to ‘Zidaki za Varšavo’). There were some punctuation er- rors, but the most interesting are grammatical mistakes of both MT ser- vices, especially when the wrong grammatical case, gender, or number is used. The answers had to be in the exact same form, so many answers do not sound coherent, which is of course not the case for English, where the conjugation does not change the words as much (e.g. with eTransla- tion ‘Which part of China had people ranked higher in the class system?’ — ‘Northern’ — ‘V katerem delu Kitajske so bili ljudje višje v razrednem sistemu?’ — ‘Severni’). On the other part, some corrected segments were identical even though the source was different due to the use of articles in the English language (e.g. ‘North Sea’ and ‘the North Sea’ were both translated as ‘Severno morje’). This occurred with both MT, but in some cases Google CT performed better, producing more exact matches. It was also better at capturing the same amount or length of answers as in the original. The answer of eTranslation for ‘harvests of their Chinese tenants’ was: ‘čemer je dohodek od žetve kitajskih najemnikov’, whereas Google CT captured only ‘žetve njihovih kitajskih najemnikov’. It should also be noted that the database SQuAD 2.0 is not entirely reliable. From the batch of randomly sampled 142 test question and an- swer groups, there were 14 occurrences where at least one of the given answers was not correct (e.g. ‘Advanced Steam movement’ instead of ‘pollution’ as an answer to ‘Along with fuel sources, what concern has contributed to the development of the Advanced Steam movement?’). 6.3 Qualitative analysis of predictions By observing the individual cases of incorrect predictions, the main so- urce of errors seems to stem from the grammatical and stylistic errors of the machine translation and occasionally its inability to convey the right meaning. The correct predictions are most likely the ones where the answer to the question is short and the words are not conjugated, i.e. numbers and names, even though there are some exceptions. In the examples provided, we can see that there are two types of errors that we looked at. The first is when there is a wrong answer, but a right prediction (in Table 8), and the second is the correct answer and 265 Adapting an English Corpus and a Question Answering System for Slovene the wrong prediction (Table 9). Most of the time, the wrong answers and predictions occur with the eTranslation service, and improvement of Google CT and HT is visible from a few representative examples, but sometimes, when the questions are more complicated, even the Goo- gle CT and HT do not provide a prediction at all, while sometimes only HT provides the correct prediction. Table 8: Examples of correct predictions with wrong answers # Dataset Question Answer Prediction 1 ENG How many of Warsaw’s inhabitants spoke Polish in 1933? 833,500 833,500 ET Koliko prebivalcev Varšave je leta 1933 govorilo poljsko? prebivalcev 833.500 GCT Koliko prebivalcev Varšave je leta 1933 govorilo poljsko? 833.500 833.500 HT Koliko prebivalcev Varšave je leta 1933 govorilo poljski jezik? 833.500 833.500 2 ENG Who recorded “Walking in Fresno?” Bob Gallion Bob Gallion je ET Kdo je posnel “Walking in Fresno?” Bob Bob Gallion GCT Kdo je posnel “Walking in Fresno”? Kdo je posnel »Walking in Fresno«? Bob Gallion Bob Gallion HT Kdo je posnel “Walking in Fresno”? Kdo je posnel »Walking in Fresno«? Bob Gallion Bob Gallion Note. ENG denotes the English dataset, ET one translated by the eTranslation service, GCT one translated by the Google Cloud translation service, and HT one translated by a human. Table 9: Examples of correct answers with wrong predictions. ENG denotes the English dataset, ET one translated by the eTranslation service, GCT one translated by the Google Cloud translation service, and HT one translated by a human # Dataset Question Answer Prediction 1 ENG How many total acres is Woodward Park? 300 acres 300 acres ET Koliko hektarjev je Woodward Park? 300 hektarjev 235 hektarjev GCT Koliko skupno hektarjev obsega Woodward Park? 300 hektarjev 300 HT Koliko akrov skupaj obsega park Woodward? 300 akrov 300 2 ENG How many miles, once completed, will the Lewis S. Eaton trail cover? 22 miles 22 ET Koliko kilometrov, ko bo konč ano, bo pokrivalo Lewis S. Eaton? 22 milj (35 km GCT Koliko milj bo pot Lewisa S. Eatona pokrivala, ko bo konč ana? 22 milj 22 HT Koliko milj bo, ko bo dokonč ana, dolga pot Lewisa S. Eatona? 22 milj 22 266 Slovenščina 2.0, 2023 (1) | Articles 7 Conclusion In this work, we presented a method for the machine translation of question answering datasets. To evaluate the method, we applied it to the SQuAD 2.0 dataset and used the results to train and test the fol- lowing question answering models: Multilingual BERT, CroSloEngual, SloBERTa 2.0, XLM-RoBERTa, and RemBERT. In order to discern the impact of the quality of the translated data we performed the transla- tion with two different translation services: eTranslation and state-of- -the-art Google Cloud Translation. To evaluate the models using as clo- se to real data as possible, we took a small subset of the original testing set and manually translated it to Slovene, which formed the basis for the performance comparisons. The results show that using machine-translated data for evaluati- on led to notably worse results as compared to the human-translated data. Moreover, we noticed that while multilingual models fine-tuned using machine-translated data performed better than ones fine-tuned on English data when given a task of answering the machine-translated question, the situation was in most cases reversed when given a task of answering human-translated questions. This leads us to conclude that machine translation, at least as available via the eTranslation service, is not particularly suitable for training multilingual models. Of all the models, SloBERTa 2.0 produced the best results on both machine- and human-translated data, while the RemBERT gave comparable results even when only fine-tuned on the English dataset. The results show that the application of machine-translated data produced by our method leads to better results on smaller multilingual question answering models, as compared to fine-tuning them using the original, English, data. On the other hand, the results for larger models were mostly unaffected by the language of the dataset used to train them. The most notable benefit is the ability to fine-tune monolingu- al models, which would otherwise be unusable. Our experiments also show that this machine-translated data is not suitable for the purpose of testing the models. The impact of the quality of the translation servi- ce is minor and varies depending on the model. The testing procedure could be improved by using a dataset that was already manually translated to Slovene, which would allow us to 267 Adapting an English Corpus and a Question Answering System for Slovene benchmark our results against it as well. The experiment could also be expanded by including more datasets, such as Natural Questions (Kwiatkowski et al., 2019), and other models, such as Microsoft’s mDe- BERTaV3. Additionally, further effort could be dedicated to ascertaining the optimal parameters for fine-tuning the question answering models. References Arnejšek, M., & Unk, A. (2020). Multidimensional assessment of the eTransla- tion output for English–Slovene. Proceedings of the 22nd Annual Confe- rence of the European Association for Machine Translation (pp. 383–392). Lisboa: European Association for Machine Translation. Retrieved from https://aclanthology.org/2020.eamt-1.41 Chung, H. W., Févry, T., Tsai, H., Johnson, M., & Ruder, S. (2021). Rethinking Embedding Coupling in Pre-trained Language Models. International Con- ference on Learning Representations. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., . . ., & Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Associa- tion for Computational Linguistics (pp. 8440–8451). Online: Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.747 Čeh, I., & Ojsteršek, M. (2009). Developing a question answering system for the Slovene language. WSEAS Transaction on Information science and applications. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Pro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi- es, (Vol. 1, pp. 4171–4186). Minneapolis: Association for Computational Linguistics. doi: 10.18653/v1/N19-1423 European Commission. (2020). CEF Digital eTranslation. CEF Digital eTranslation. Koehn, P., & Knowles, R. (2017). Six Challenges for Neural Machine Transla- tion. In Proceedings of the First Workshop on Neural Machine Translation (pp. 28–39). Vancouver: Association for Computational Linguistics. doi: 10.18653/v1/W17-3204 Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., . . ., & Petrov, S. (2019). Natural Questions: a Benchmark for Question 268 Slovenščina 2.0, 2023 (1) | Articles Answering Research. Transactions of the Association of Computational Linguistics. Lample, G., & Conneau, A. (2019). Cross-lingual Language Model Pretraining. Advances in Neural Information Processing Systems (NeurIPS). Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). ALBERT: A Lite BERT for Self-supervised Learning of Language Represen- tations. International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=H1eA7AEtvS Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., . . ., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Ljubešić, N., & Dobrovoljc, K. (2019). What does Neural Bring? Analysing Im- provements in Morphosyntactic Annotation and Lemmatisation of Slove- nian, Croatian and Serbian. In Proceedings of the 7th Workshop on Balto- -Slavic Natural Language Processing (pp. 29–34). Florence: Association for Computational Linguistics. doi: 10.18653/v1/W19-3704 Martin, L., Muller, B., Ortiz Suárez, P. J., Dupont, Y., Romary, L., de la Clergerie, É., . . ., & Sagot, B. (2020). CamemBERT: a Tasty French Language Model. Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics (pp. 7203–7219). Online: Association for Computatio- nal Linguistics. Retrieved from https://www.aclweb.org/anthology/2020. acl-main.645 Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . ., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text- -to-Text Transformer. Journal of Machine Learning Research, 21, 1–67. Retrieved from http://jmlr.org/papers/v21/20-074.html Rajpurkar, P., Jia, R., & Liang, P. (2018). Know What You Don’t Know: Unan- swerable Questions for SQuAD. doi: 10.48550/ARXIV.1806.03822 Ulčar, M., & Robnik-Šikonja, M. (2020). Finest BERT and CroSloEngual BERT. International Conference on Text, Speech, and Dialogue (pp. 104–111). Ulčar, M., & Robnik-Šikonja, M. (2021). SloBERTa: Slovene monolingual large pretrained masked language model. In Proceedings of Data Mining and Data Warehousing, SiKDD. Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., . . ., & Pyysalo, S. (2019). Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076. Woods, W. A., & WOODS, W. A. (1977). Lunar rocks in natural English: Explora- tions in natural language question answering. 269 Adapting an English Corpus and a Question Answering System for Slovene Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., . . ., & Raffel, C. (2021). mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langu- age Technologies (pp. 483–498). Online: Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41 Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understan- ding. Advances in neural information processing systems, 32. Zupanič, M., Zirkelbach, M., Šmajdek, U., & Jazbinšek, M. (2022). Preparing a corpus and a question answering system for Slovene. In D. Fišer & T. Erja- vec (Eds.), Jezikovne tehnologije in digitalna humanistika: zbornik konfe- rence (pp. 353–359). Ljubljana, Inštitut za novejšo zgodovino. Retrieved from https://nl.ijs.si/jtdh22/pdf/JTDH2022_Proceedings.pdf 270 Slovenščina 2.0, 2023 (1) | Articles Appendix A: Translation examples Below are some examples of two machine translations and a human translation, where some specific differences, which occur more times, can be seen. 1. Example of context segment (excerpt) • Original o The Northern Chinese were ranked higher and Southern Chinese were ranked lower because southern China wi- thstood and fought to the last before caving in. • eTranslation o Severna Kitajci so bili uvrščeni višje in južna Kitajci so bili uvrščeni nižje, ker je južna Kitajska zdržala in se borila do zadnjega pred jamarstvom. • Google Translate o Severni Kitajci so bili uvrščeni višje, južni Kitajci pa nižje, ker je južna Kitajska zdržala in se borila do zadnjega, preden je popustila. • Human translation o Severni Kitajci so bili uvrščeni višje in južni Kitajci so bili uvrščeni nižje, ker se je južna Kitajska pred predajo upira- la in se borila do zadnjega. 2. Examples of answerable question segment • Original • Who was Al-Banna’s assassination a retaliation for the prior assassination of? • What plants create most electric power? • eTranslation • Kdo je bil Al-Bannin umor maščevanja zaradi predhodnega umora? • Katere rastline ustvarjajo največ električne energije? • Google CT • Komu je bil atentat Al-Banne povračilo za prejšnji atentat? • Katere rastline proizvajajo največ električne energije? • Human translation • Al-Bannov umor je bil maščevanje za čigav predhodni umor? • Kateri obrati ustvarjajo največ električne energije? 271 Adapting an English Corpus and a Question Answering System for Slovene 3. Example of impossible question segment • Original o What Book of the Bible is knowledge of the law traced back to? • eTranslation o Do katere knjige Svetega pisma je znano pravo? • Google CT o Od katere svetopisemske knjige sega znanje o zakonu? • Human translation o V kateri svetopisemski knjigi že zasledimo poznavanje prava? Appendix B: HTML Structure The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (“Norman” comes from “Norseman”) raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged ini- tially in the first half of the 10th century, and it continued to evolve over the succeeding centuries. In what country is Normandy located? France When were the Normans in Normandy? 10th and 11th centuries The Normans (Norman: Nourmands; French: Normands; La- tin: Normanni) were the people who in the 10th and 11th cen- turies gave their name to Normandy, a region in France. They were descended from Norse (“Norman” comes from “Norseman”) raiders and pirates from Denmark, Iceland and Norway who, under their le- ader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries. 272 Slovenščina 2.0, 2023 (1) | Articles in the 10th and 11th centuries The Normans (Norman: Nourmands; French: Normands; La- tin: Normanni) were the people who in the 10th and 11th cen- turies gave their name to Normandy, a region in France. They were descended from Norse (“Norman” comes from “Norseman”) raiders and pirates from Denmark, Iceland and Norway who, under their le- ader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries. ...\\ ...\\ Appendix C: Detailed results Table 10: Full comparison of the results of various models and their fine-tuning configura- tions on the human-translated subset of SQuAD 2.0 (N=285), and the subsets containing the same question from the original English dataset and the two machine-translated datasets Model Name Fine-Tuning Dataset Original [%] eTranslation [%] Google CT [%] Human Transl. [%] Exact F1 Exact F1 Exact F1 Exact F1 M-BERT Original 76.1 78.4 42.8 48.2 53.0 55.9 55.1 58.2 eTranslation 58.2 62.6 58.6 64.5 68.7 73.4 70.9 74.0 Google CT 59.3 65.1 58.9 64.8 66.7 71.4 69.4 73.6 CroSloEngual BERT Original 73.3 75.5 53.0 60.8 57.5 63.4 61.1 65.1 eTranslation 59.6 63.1 51.6 58.8 58.2 64.5 60.0 63.6 Google CT 55.8 58.8 60.7 66.5 61.8 66.6 70.2 73.6 SloBERTa 2.0 eTranslation 59.3 65.0 64.9 72.2 69.5 76.1 70.9 74.9 Google CT 61.8 65.2 61.4 68.0 66.3 72.9 73.0 78.3 XLM-R Large Original 83.5 85.5 61.4 69.1 70.9 75.8 78.6 81.6 eTranslation 79.3 82.6 66.0 73.1 71.2 76.9 76.1 81.1 Google CT 79.3 82.3 63.5 70.9 71.2 77.4 76.8 81.4 RemBERT Original 84.9 87.2 64.2 71.4 69.1 74.3 74.0 79.5 eTranslation 80.0 84.1 64.9 72.9 70.2 76.6 71.9 78.6 Google CT 80.4 84.8 63.2 71.6 70.2 76.0 77.9 81.5 Note. Specific parameters used in fine-tuning are presented in Table 5. 273 Adapting an English Corpus and a Question Answering System for Slovene Table 11: Full comparison of the results of various models and their fine-tuning configura- tions on the English SQuAD 2.0 evaluation dataset and the two Slovene machine-translated SQuAD 2.0 evaluation datasets (N=11.680) Model Name Fine-Tuning Dataset Original [%] eTranslation [%] Google CT [%] Exact F1 Exact F1 Exact F1 M-BERT Original 75.8 78.9 52.7 59.2 57.1 61.9 eTranslation 63.7 68.2 61.7 68.3 65.7 70.7 Google CT 64.4 68.9 61.4 67.9 66.9 71.3 CroSloEngual BERT Original 72.9 76.3 56.2 63.5 61.5 66.8 eTranslation 63.6 68.2 58.3 65.5 62.3 68.3 Google CT 61.4 65.7 59.8 66.5 65.3 70.0 SloBERTa 2.0 eTranslation 60.6 64.7 66.6 73.7 71.4 76.8 Google CT 63.9 66.9 65.6 72.8 72.3 77.0 XLM-R Large Original 83.4 86.3 67.1 74.8 73.4 78.5 eTranslation 79.0 83.0 68.0 75.6 72.3 78.3 Google CT 80.9 84.4 68.0 75.5 75.3 80.1 RemBERT Original 84.5 87.5 67.1 71.4 69.1 74.3 eTranslation 79.1 83.9 66.8 72.9 70.2 76.6 Google CT 80.1 84.5 67.0 71.6 70.2 76.0 Note. The English dataset only contains the questions pre-set in its Slovene counterpart. Specific parameters used in fine-tuning are presented in Table 5. Prilagoditev angleškega korpusa in sistema za odgovarjanje na vprašanja za slovenski jezik Pomanjkanje ustreznih podatkov za učenje je ena od ključnih težav pri raz- voju slovenskih modelov za odgovarjanje na vprašanja (QA). Sodobna orodja za strojno prevajanje lahko to težavo rešijo, vendar pa se pri njihovi uporabi soočimo z novih izzivom: odgovori se morajo natančno ujemati z deli dane- ga konteksta, kjer ta odgovor je, saj model odgovorov ne generira, temveč le išče. Kot rešitev predlagamo metodo, kjer odgovore prevajamo skupaj s kontekstom, kar poveča verjetnost, da bo odgovor preveden v enaki obliki. Učinkovitost te metode ocenjujemo na naboru podatkov SQuAD 2.0, preve- denem z uporabo storitev eTranslation in Google Cloud, kjer se z njeno upo- rabo delež neujemanj odgovora in konteksta zmanjša s 56 % na 7 %. Pre- vedene podatke nato ocenimo z uporabo različnih QA modelov, ki temeljijo na arhitekturi transformer, in preučimo razlike med podatkovnimi nizi in 274 Slovenščina 2.0, 2023 (1) | Articles konfiguracijami modelov. Da zagotovimo čim bolj realistične rezultate, mod- ele testiramo na človeških prevodih majhnega deleža izvirne zbirke podatkov. Rezultati kažejo, da se glavne prednosti uporabe strojno prevedenih podatkov pokažejo pri natančnem prilagajanju (angl. fine-tuning) manjših večjezičnih modelov in enojezičnih modelov. Večjezični CroSloEngual BERT model je na primer dosegel 70,2 % točnih ujemanj pri testiranju na slovenskih podatkih v primerjavi s 73,3 % točnih ujemanj pri testiranju na angleških podatkih. Medtem ko so bili rezultati pri večjih modelih podobni, pri čemer je RemBERT dosegel 77,9 % točnih ujemanj na slovenskih podatkih v primerjavi z 81,1 % na angleških podatkih, so se ti obnesli podobno tudi pri natančnem prilaga- janju na angleških podatkih, kar pomeni, da jih strojno prevedeni podatki niso bistveno izboljšali. Ključne besede: sistemi za odgovarjanje na vprašanja, strojno prevajanje, večjezični modeli