https://doi.or g/10.31449/inf.v46i7.4236 Informatica 46 (2022) 103–1 18 103
Automatic Question Generation using RNN-based and Pr e-trained
T ransformer -based Models in Low Resour ce Indonesian Language
Karissa V incentio and Derwin Suhartono
Computer Science Department, School of Computer Science, Bina Nusantara University , Jakarta, 1 1530, In-
donesia
E-mail: felicia.vincentio@binus.ac.id, dsuhartono@binus.edu
Keywords: natural language processing, natural language generation, automatic question generation, recurrent
neural network, long-short term memory , gated recurrent unit, transformer , fine-tuning
Received: June 14, 2022
Although Indonesian is the fourth most fr equently used language on the internet, the development
of NLP in Indonesian has not been studied intensively . One form of NLP application classified as
an NLG task is the Automatic Question Generation task. Generally , the task has pr oven well, us-
ing rule-based and cloze tests, but these appr oaches depend heavily on the defined rules. While
this appr oach is suitable for automated question generation systems on a small scale, it can be-
come less efficient as the scale of the system gr ows. Many NLG model ar chitectur es have r ecently
pr oven to have significantly impr oved performance compar ed to pr evious ar chitectur es, such
as generative pr e-trained transformers, text-to-text transfer transformers, bidir ectional auto-
r egr essive transformers, and many mor e. Pr evious studies on AQG in Indonesian wer e built on
RNN-based ar chitectur e such as GRU, LSTM, and T ransformer . The performance of models in
pr evious studies is compar ed with state-of-the-art models, such as multilingual models mBART
and mT5, and monolingual models such as IndoBART and IndoGPT . As a r esult, the fine-tuned
IndoBART performed significantly higher than either BiGRU and BiLSTM on the SQuAD dataset.
Fine-tuned IndoBART on most of the metrics also performed better on the T yDiQA dataset only ,
which has fewer population than the SQuAD dataset.
Povzetek: Za indonezijščino, četrti najpogostejši spletni jezik, so za poučevanje razvili jezikovni
pr etvornik iz besedila v vprašanja.
1 Intr oduction
The current education system requires a process to ef-
ficiently evaluate students’ understanding of lessons
by reading a text’ s content [1]. Preparation of ques-
tions carried out by students can consume much time,
while getting questions from external sources such as
collections of questions makes it possible that they are
irrelevant to the content studied by students [2]. In ad-
dition, questions designed to evaluate students’ under -
standing of textual reading can also be influenced by
their ef fectiveness, which can be seen from the devel-
opment of various strategies in preparing questions [3].
From the emer gence of these problems, many tech-
niques have been investigated in the Question Genera-
tion process based on content, generally known as the
Automatic Question Generation (AQG) system based
on NLP in the NLG branch from various approaches
from rule-based to attention-based models [4].
AQG is a job that automatically generates queries
from various inputs such as original text, databases, or
semantic representations [5]. From this understanding,
the input type can take the form of various forms such
as sentences, paragraphs, and poetry [6]. AQG has
various applications such as healthcare systems, auto-
mated help systems, chatbot systems, and other AQG
applications [7]. In this paper , AQG is the subject
of research that requires text or related information to
be processed using a sequence-to-sequence approac h,
namely Bidirectional Gated Recurrent Unit (BiGRU),
Bidirectional Long Short T erm Memory (BiLSTM),
and T ransformer architectures, as well as using the pre-
trained fine-tuning approach of the mBAR T and mT5
architectural models.
104 Informatica 46 (2022) 103–1 18 K. V incentio et al.
1.1 AQG in English
Cohen first proposed AQG in 1929 to represent a ques-
tion’ s content in a formula with one or more inde-
pendent variables [8]. Since then, researchers have
become interested in developing AQG in education
for educational purposes, mainly because asking ques-
tions during teaching encourages students to under -
stand what they are learning. One of the AQGs that
W olfe proposed supported learning in 1976 [9].
Recent AQG work showed that leveraging linguis-
tic representation approaches such as Part Of Speech
(POS) and Named Entity Recognition (NER) through
deep neural networks based on Bidirectional En-
coder Representations from T ransformers (BER T) can
achieve state-of-the-art results. The model architec-
ture consists of a two-layer bidirectional Long Short-
T erm Memory (LSTM) encoder and a two -layer uni-
directional LSTM decoder . The bidirectional LSTM
encoder has been used for producing sequences of hid-
den states, and the unidirectional LSTM decoder has
then used the representation t o generate words [10].
Another recent work was fine-tuning a miniature
version of a T5 transformer language model consist-
ing of 220 million parameters using the SQuADv1.1
dataset, which contains 100,000 question-answer
pairs. In order to generate questions, the model was
trained by receiving the passage, and the 30% prob-
ability of the answer was replaced with the [MASK]
token [1 1]. Some open English question-answer pair
datasets can be leveraged for transfer learning ap-
proaches in creating AQG systems [12, 13, 14, 15, 16].
1.2 AQG in Indonesian
V arious researches on AQG based on NLP have been
conducted [17], but not many of them are observed
in Indonesian. One study [18] that was conducted
in Indonesian built a language model that utilizes a
sequence-to-sequence approach and is trained on the
SQuAD v2 [19] as well as T yDiQA [20] dataset, which
has been translated into Indonesian using the Google
T ranslate API v2 to the model with the T ransformer
architecture along with Recurrent Neural Network
(RNN) such as BiLSTM & BiGRU [21]. This study
found that the questions generated using BiLSTM and
BiGRU were not significantly dif ferent. Meanwhile,
the use of T ransformers found dif ficulties in under -
standing the semantic context of the information pro-
vided [22].
Figure 1: Related Research Modeling Process Dia-
gram [18]
1.3 Models Benchmark in NLG T asks
In this research, benchmark resources [23, 24] have
not been involved in the results of the model studied
for the Question Generation task in Indonesian [18].
Indonesian benchmark resources such as IndoNLU
[25] & IndoNLG [4] can play a significant role in com-
parisons and literature reviews by other researchers so
that t hey can be a reference in developing Automatic
Question Generation that is more reliable and by the
information provided in the form of textual or reading
content.
In addition, because the IndoNLU benchmark only
covers NLU tasks in Indonesian, such as sentiment
analysis [26, 27, 28, 29, 30], which is similar to the
GLUE [31] benchmark for English Natural Language
Understanding (NLU) tasks, while the Question Gen-
eration task is an NLG task, the GEM benchmark,
which is a benchmark for various NLG tasks including
Question Generation, should also be applied [32]. The
resources of the GEM benchmark have selected and
processed the most common dataset for the available
NLG tasks. GEM also conducted baseline modeling
using language models such as BAR T , T5 in one lan-
guage, and the mBAR T model, mT5 for multilingual
languages. Then GEM also provides a testbed in auto-
mated evaluations, including metrics according to the
task. The GEM benchmark feature, which is regularly
updated, makes it easier for researchers in other fields
of NLG to compare the model he built with previously
developed models [32].
Recently , many NLG model architectures have
proven to have significantly improved performance
compared to previous architectures, such as generative
pre-trained transformers, text-to-text transfer trans-
formers, bidirectional auto-regressive transformers,
and many more. Meanwhile, the previous study that
raised AQG in Indonesian was only carried out on the
GRU, LSTM, and T ransformer methods [18]. In this
study , to develop research on AQG assignments in In-
donesian, the performance of models in previous stud-
ies was compared with state-of-the-art models, such as
Automatic Question Generation on… Informatica 46 (2022) 103–1 18 105
mBAR T and mT5.
Although Indonesian is the fourth most frequently
used language on the internet, the development of
NLP in Indonesian has not been studied intensively
[25]. The automatic question generation process is
one form of NLP application classified as an NLG task
[18]. Generally , the task has proven well, using ap-
proaches such as rule-based and cloze tests, but these
approaches depend heavily on the set of rules that have
been created. So this approach is suitable for auto-
mated question generation systems on a small scale
and can become less ef ficient as the scale of the system
grows [33]. In this context, deep learning approaches,
especially NLP , have better generalizations than rule-
based approaches [34]. Although the deep learning
approach is relatively highly complex, the system can
construct its rules and evolve coherently to adapt its
dataset if adequately trained and properly configured
[35].
In the previous related research [18], the Stanford
question-and-answer dataset (SQuAD v2 [36]), which
consists of 536 articles with 161,550 collections of
question-answer pairs in English, underwent transla-
tion and pre-processing into Indonesian and followed
by improving some of the translations by using fuzzy
string-matching to look for inconsistent translations,
which then can be used for model training as well as
model evaluation [18]. Language models based on
RNN architecture such as GRU and LSTM in a bidi-
rectional manner and language models based on trans-
former architecture with a sequence-to-sequence learn-
ing approach [18].
Several adaptations were made to the languag e
model on the RNN-based (BiGRU and BiLSTM) and
transformer from scratch, such as the use of several lin-
guistic features (Ans, Case, POS, Named Entity (NE)),
and the presence of a sentence embedding encoder .
This research aims to measure how well the language
model based on the RNN architecture and the state-of-
the-art transformer -based models performs the ques-
tion generation task in Indonesian [18]. Then, a vali-
dation process was followed by testing the model us-
ing the SQuAD dataset as it is the validation set to see
how it performs on the same behavior dataset and fol-
lowed by the evaluation on the T yDiQA dataset that is
built naturally from Indonesian [20]. Overall flow can
be seen in Figure 1.
2 Deep Learning Methods
2.1 RNN Based Models
RNN is a widely used neural network architecture for
NLP , which has been proven to be relatively accurate
and ef ficient for developing language models as well
as in tasks of speech recognition. Essentially RNN
uses what is known as feedback loops which allow the
input sequence to be shared to dif ferent nodes as well
as allowing RNN to have an internal memory that can
help RNN generate predictions based on previous in-
puts. As much NLP research progresses from time to
time, there are many novelty techniques, one of which
showed that bidirectionally processing the input se-
quence can achieve a better understanding of the con-
text [21]. V isualization for each model can be seen in
Figure 2.
Figure 2: RNN vs. LSTM vs. GRU
In the first place, RNN was having a problem called
the vanishing gradient problem, which occurs when
using neural networks with gradient-based learning
methods and backpropagation. GRU (Gated Recur -
rent Unit) was introduced to overcome this problem
that utilizes an update gate and reset gate so the model
can store information longer and remove irrelevant in-
formation for prediction. BiGRU is a model that uses
two GRU in which one GRU will accept input by for -
warding direction called forward GRU, and another
will accept input by backward direction named back-
ward GRU [21].
LSTM is an RNN enhancement that is capable of
studying long-term dependencies [18]. This capability
enables LSTMs to avoid long-term dependency prob-
lems. LSTM uses three gates to protect and control
the cell gates: input gate, for get gate, and output gate.
In this research, BiLSTM will be used as the represen-
tation of LSTM. The flow of each model in detail can
be seen in Figure 3.
106 Informatica 46 (2022) 103–1 18 K. V incentio et al.
Figure 3: LSTM / GRU vs. BiLSTM / BiGRU Archi-
tecture
2.2 T ransformer Based Models
BAR T (Bidirectional and Auto-Regressive T rans-
former) is a language model that is pre-trained by ap-
plying noise or corruption to the input sequence, and
then the model is assigned to reconstruct the actual in-
put sequence [33]. After that, the results of the model
predictions will be calculated against the loss function
generally in the f orm of cross-entropy and followed
by the back-propagation gradients process and updat-
ing the model weights. A comparison between RNN
with transformer architecture can be seen in Figure 4.
Figure 4: RNN vs. T ransformer Architecture
The BAR T language model architecture utilizes an
encoder (see Figure 5) on BER T (Bidirectional En-
coder Representations from T ransformers) [37] and
Figure 5: Encoder -decoder Illustration
a decoder on GPT (Generative Pre-T rained T rans-
former) [38] capable of performin g NLP tasks in the
form of NLU and NLG. mBAR T is a language model
modified from the BAR T model, which utilizes au-
toencoder denoising and a s equence-to-sequence pre-
trained model. mBAR T model was trained once us-
ing a dataset of multiple languages that could be cus-
tomized in a fine-tuning process [39].
The second one, T5 (T ext-to-T ext T ransfer T rans-
former), is a pre-trained model that utilizes a unified
T ext-to-T ext format from NLP using text [40]. By us-
ing this model, when setting the configuration with hy-
perparameters, then it will be applied to another task.
The third one, GPT leverages what is known as
masked self-attention, where it masks future tokens
and only knows the present and the previous tokens.
GPT works autoregressively by adding generated to-
kens to the input sequence, and that particular new se-
quence then will be used as the input to the model in
its next step.
3 Materials and Methods
The approach that is going to be used in this research
is transfer learning, which is the approach where a
model is first pre-trained on a data-rich task before
being fine-tuned on a downstream task [40]. The re-
search is divided into four main steps; the first step
(planning phase) is to identify the problem, followed
by the dataset preprocessing phase, which was done
to preprocess the dataset in a configured format so
that the dataset can be forwarded to train the language
model. The third step is to train the model, which
leverages several preconfigured language models such
as BiGRU and BiLSTM. As for the pre-trained lan-
guage models like mBAR T and mT5 will require fur -
ther fine-tuning.
Automatic Question Generation on… Informatica 46 (2022) 103–1 18 107
3.1 Planning
In the first step, previous research regarding automatic
question generators using BiGRU and BiLSTM in In-
donesian was reviewed and evaluated. Both of those
models are based on recurrent neural network archi-
tecture. This research explores multilingual language
models based on the transformer (encoder -decoder) ar -
chitecture such as mBAR T and mT5 and monolingual
language models such as IndoNLG’ s IndoBAR T and
IndoGPT [4].
3.2 Dataset Pr epr ocessing
The dataset we are using is preprocessed SQuAD
dataset that has been translated to Indonesian from the
dataset itself for all the models, resulting in 102,657
training data, 1 1,407 validation data, and 10,567 test-
ing data for SQuAD, and 550 testing data for T yDiQA.
As for the sequence-to-sequence language model, the
preprocessed SQuAD dataset was first added with spe-
cial tokens.
Since some of the SQuAD question-answer pairs In-
donesian dataset translation might be misleading from
its true meaning because the translation pro cess for the
passage and the context were done separately , some
corrections need to be made. The correction pro-
cess was done by leveraging fuzzy string matching to
search the translated question-answer pairs for incon-
sistent translations thoroughly . As long as the answer
is found, it will update the start position of the answer ,
whereas if the answer is not found, then the start posi-
tion of the answer will be set to negative one (-1) and
removed.
In this step, we preprocessed the enhanced SQuAD
dataset from previous research [36] by reusing
some of the main dataset attributes (context/passage,
question, and answer) that are going to be used
by the model for training, excluding some of the
linguistic features such as part of speech (POS) and
named entity (NE) attributes that will be used only
by BiGRU, BiLSTM, and T ransformer model. For
mBAR T , the input encoder structure will be formatted
to <context><sep><answer><eos><langid> ,
and decoder <langid><question><eos> ,
whereas for mT5 the input encoder will be in
<context><sep><answer> format, as for the de-
coder format is going to be <bos><question> .
As for GPT since it is a decoder only trans-
former language model, then the input se-
quence will be formatted to <context><sep>
<answer><bos><question><eos> . The flow can
be seen in Figure 6.
Figure 6: Process to Repair Answer T ranslation Result
from Previous Research
3.3 T raining Model
By utilizing the formatted dataset, these models were
made by applying configuration from the Sequence-to-
Sequence Learning method for the Indonesian Auto-
matic Question Generator for several algorithms, Bi-
GRU and BiLSTM. For the mBAR T and mT5, new
fine-tuning models were made [41]. Alternately , the
model training was conducted to ensure the computer
uses the same resources.
3.4 Evaluation
The results fro m each algorithm are evaluated by us-
ing BLEU and ROUGE metrics. From that, compar -
ison and analysis results are conducted based on the
results to choose the best from all of the implemented
algorithms. The overall flow of benchmarking model
can be seen in Figure 7.
4 Results & Discussion
T okenization is a way to separate a piece of text into
smaller units known as tokens, which can be words,
characters, or subwords. In order to fine-tune the
mBAR T pre-trained language model, the sequence
that is going to be forwarded to the model will firstly
be appended with some special tokens such as lan-
guage id token for the multilingual model to identify
108 Informatica 46 (2022) 103–1 18 K. V incentio et al.
T able 1: Related Research Rerun Result on SQuAD T est Set
Model Dataset BLEU 1 BLEU 2 BLEU 3 BLEU 4 ROUGE
L
Epoch
BiGRU
Cased 33.87 17.01 8.42 3.88 3 7.98 20
Cased-Copy 36.03 19.37 9.57 5.41 40.96 20
Cased-Copy-Cov 36.16 18.79 10.99 6.52 40.49 20
Uncased 36.96 17.23 8.46 5.1 1 40.08 20
Uncased-Copy 39.62 22.26 12.02 5.88 43.38 20
Uncased-Copy-Cov 39.56 21.99 1 1.34 5.99 43.41 20
BiLSTM
Cased 32.16 14.61 7.73 3.73 3 8.00 10
Cased-Copy 36.67 19.28 10.85 5.29 40.67 10
Cased-Copy-Cov 35.86 18.69 9.21 7.07 40.27 10
Uncased 35.45 18.19 8.87 4.63 39.48 10
Uncased-Copy 40.60 21.35 10.93 5.73 43.79 10
Uncased-Copy-Cov 39.90 22.23 12.49 5.98 43.34 10
T ransformer
Cased 30.72 12.63 4.44 2.46 3 4.25 300
Cased-Copy 36.14 18.81 9.52 4.75 3 9.58 300
Uncased 33.34 13.58 5.86 3.38 3 7.71 300
Uncased-Copy 39.09 21.21 10.83 5.39 43.69 300
Figure 7: Models Benchmark creation process dia-
gram.
the language, beginning of sentence token, end of sen-
tence token, as well as separator token to be used as a
separator between the answer and the question, which
then can be used for the model to identify the label
and the context. Then the tokenizer will tokenize the
rest of the passage into token representation so that the
model can understand the context. Unlike mBAR T ,
fine-tuning mT5 does not require a language id token
to help the model identify the language that is sup-
posed to be appended to the input sequence. Surpris-
ingly when not considering the special tokens or skip-
ping special tokens, the language model cannot parse
the input sequence correctly; therefore, it performs
poorly .
T able 3 and T able 4 are some samples of the gener -
ated questions in Indonesian from each of all the mod-
els on datasets SQuAD and T yDiQA. “Input Sentence
& Answer” is the context or passage as the model input
followed by the expected answer , and “T ar get Ques-
tion” is the expected generated question.
RNN-based models evaluated on the T yDiQA
dataset are performing lower than SQuAD dataset due
to most of the text being translated on the SQuAD
dataset, which consists of many faulty translations
[18], while T yDiQA is in Indonesian by origin. It
also applies to the transformer -based models, includ-
ing those based on pre-trained multilingual and mono-
lingual models. It can also be seen on the pre-trained
models’ row that the maximum score on ROUGE and
BLEU on T yDiQA is up to 10 points higher than the
SQuAD dataset. This evaluation on the T yDiQA test
set is done to obtain a more reliable and comparable
evaluation score since T yDiQA is a more natural In-
donesian dataset [18].
On the RNN-based and transformer from scratch re-
sults, the T yDiQA and SQuAD do not show a signif-
icant dif ference in the scores, but they dif fer signifi-
cantly on the pre-trained models, especially the mono-
lingual models IndoBAR T and IndoGPT . W ith these
monolingual models, the T yDiQA dataset that is al-
ready available in Indonesian while SQuAD is mostly
Automatic Question Generation on… Informatica 46 (2022) 103–1 18 109
T able 2: Related Research Rerun Result on T yDiQA T est Set
Model Dataset BLEU 1 BLEU 2 BLEU 3 BLEU 4 ROUGE
L
Epoch
BiGRU
Cased 30.33 10.83 3.18 1.89 34.41 20
Cased-Copy 34.05 15.01 7.47 3.12 38.15 20
Cased-Copy-Cov 34.28 14. 72 6.60 2.63 38.30 20
Uncased 33.28 13.92 5.65 2.98 37.45 20
Uncased-Copy 37.42 17.96 8.73 4.65 41.71 20
Uncased-Copy-Cov 37.78 18.68 9.15 5.46 41.92 20
BiLSTM
Cased 30.74 12.00 4.09 1.48 34.74 10
Cased-Copy 34.62 14.80 6.49 3.62 38.66 10
Cased-Copy-Cov 34.13 1 4.57 6.40 2.84 38.17 10
Uncased 32.92 13.02 4.81 2.28 36.98 10
Uncased-Copy 37.63 18.38 8.73 4.62 42.12 10
Uncased-Copy-Cov 38.14 18.54 8.90 4.55 42.59 10
T ransformer
Cased 27.88 8.00 0.71 0.64 31.86 300
Cased-Copy 31.95 12.43 4.63 2.27 36.37 300
Uncased 29.39 8.62 1.12 0.52 33.27 300
Uncased-Copy 37.23 17.93 8.19 3.40 41.90 300
translation proves that monolinguals can perform bet-
ter on datasets within the same language.
Remembering that the finetuned pre-trained models
such as mBAR T , IndoBAR T , IndoGPT , and mT5 do
not use the linguistic features such as POS and NE pro-
vided in the preprocessed dataset, these models out-
perform the RNN-based models and the scratch mod-
els. W ithout extra context in the form of POS and NE,
the transformer -based pre-trained models have proven
that transfer learning helps the models have a better un-
derstanding than the models that do not have any base
knowledge.
Generally , monolingual language models have a
smaller number of parameters than multilingual lan-
guage models, resulting in faster model training and
smaller model size, whereas, in t his research, mono-
lingual language models were pre-trained on a lar ge
monolingual corpus (Indonesian). On the other hand,
multilingual language models were pre-trained on a
lar ge multilingual corpus, hence the term multilingual.
As for the performance of both languages, they only
perform slightly dif ferently [4].
Furthermore, numerous incorrect and unnatural
translations, especially one of the SQuAD datasets on
the input sentences and tar get questions, impact our
model predictions. Nonetheless, semantically , those
sentences were still understandable. The same pro-
jected questions were agreed upon by all models, re-
sulting in highly identical questions. There were some
variances in the verbs in the created questions, but they
are all synonyms and have similar meanings.
mT5 model seems to have the lowest automatic
evaluation score among other pre-trained models.
However , reading directly from the generated predic-
tions, the mT5 prediction seems to have the most flu-
ent prediction. mT5 encoders that could af fect this are
based on BER T language models, known for their nov-
elty through a bidirectional approach that was able to
capture the context deeper of the input sequence [37].
These encoders also take account of the relation be-
tween words, which helps capture its meaning [42],
consisting of a self-attention layer and a feedforward
network to process the sequence to the decoders. As
for the decoders, it is similar to the initially proposed
transformer language model [43]. The decoders were
leveraging the auto-regressive approach, which will be
used to produce the output sequence. mT5 were pre-
trained on a lar ge multilingual corpus that covers over
100 languages [41].
The hyperparameters used for each model are con-
figured in T able 5. As for the maximum sequence, the
length hyperparameter was set to 512 for every fine-
tuned language model.
T able 5 shows the configuration used in each model
1 10 Informatica 46 (2022) 103–1 18 K. V incentio et al.
T able 3: All Models AQG T ask Sample Predictions 1 SQuAD
Sample Pr ediction 1 - SQuAD
Input Sentence & Answer
Samudra Pasifik a tau Lautan T eduh (dari bahasa spanyol Pacifico,
artinya tenang) a dalah kawasan kumpulan air terbesar di dunia,
serta mencakup k ira-kira sepertiga permukaan Bumi, dengan luas
sebesar 179,7 j uta km2 (69,4 juta mi2). Panjangnya sekitar
15.500 km (9.600 mi) dari Laut Bering di Arktik hingga batasan
es di Laut Ross d i Antartika di selatan. Samudra Pasifik mencapai
lebar timur -barat terbesarnya pada sekitar 5 derajat U garis
lintang, di man a ia terbentang sekitar 19.800 km (12.300mi) dari
Indonesia hingga p esisir Kolombia. Batas sebelah barat samudra
ini biasanya dil etakkan di Selat Malaka. T itik ter endah permukaan
Bumi—Palung Mariana —berada di Samudra Pasifik. Samudra ini
terletak di ant ara Asia dan Australia di sebelah barat, Amerika di
sebelah timur , Antartika di sebelah selatan dan Samudra Arktik di
sebelah utara.
Answer 179,7 juta km2
T arget Question Berapa luas Samud era Pasifik?
BiGRU Uncased-Cop-Cov berapa luas samud ra pasifik ?
BiLSTM Uncased-Copy-Cov berapa luas bumi p asifik ?
T ransformer Uncased-Copy berapa luas air t erbesar di dunia ?
mBAR T -Lar ge berapakah luas s amudra pasifik?
IndoBAR T berapakah luas s amudra pasifik?
IndoGPT berapa luas total wilayah lautan pasifik ?
mT5-Small Berapa luas samud ra pasifik?
to generate the sentences and the training time needed
for the SQuAD and T yDiQA datasets. The train-
ing step and valid for the mBAR T -L, IndoBAR T , In-
doGPT , and mT5-Base pre-trained models are not
listed because they are not explicitly defined in this
modeling.
Fine-tuned mBAR T performed the best with the av-
erage BLEU 31.71 and ROUGE-L score of 46.27 on
the SQuAD dataset (T able 6) for the Indonesian ques-
tion generation task. Fine-tuned IndoBAR T also per -
formed the best with an average score of BLEU 17.26
and ROUGE L score is 33.73 on the T yDiQA dataset
(T able 6) for the Indonesian question generation task.
On the other hand, RNN-based and transformer
from scratch results on T yDiQA and SQuAD datasets
do not show a significant dif ference in the scores, but
they dif fer sig nificantly from the pre-trained models.
W ith these monolingual models, the T yDiQA, whose
origin is in Bahasa while SQuAD is m ostly transla-
tion, proves that monolinguals can perform better on
datasets within the same language.
5 Conclusions
Based on the results achieved in this research, lan-
guage models based on transformer architecture that
leverage self-attention mechanisms were able to
achieve state-of-the-art results in generating questions
compared to language models based on bidirectional
recurrent neural network architecture such as BiLSTM
and BiGRU.
This r esearch introduces a more extensive compari-
son between RNN-based and transformer -based mod-
els, including the state-of-the-art variation on the In-
donesian AQG system. In the previous research, it
has already been proven t hat the Indonesian AQG
system can be built using an as-is machine-translated
question answering dataset (SQuAD v2.0) with accept-
able results, and this research is shown that better per -
formance can be achieved with dif ferent varieties of
Automatic Question Generation on… Informatica 46 (2022) 103–1 18 1 1 1
T able 4: All Models AQG T ask Sample Predictions 2 T yDiQA
Sample Pr ediction 2 - T ydiQA
Input Sentence & Answer
Kadipaten Normandia , yang mer eka bentuk dengan
perjanjian dengan mahkota Pranc is , adalah tanah
yang indah bagi Prancis abad p ertengahan , dan di
bawah Richar d I dari Normandia ditempa menjadi
sebuah pemerintahan yang kohesif dan tangguh dalam
masa jabatan feodal.
Answer Kadipaten Normandia
T arget Question Siapa yang memerintah kadipaten Normandia
BiGRU Uncased-Cop-Cov siapa yang memerintah pemerinta han normandia ?
BiLSTM Uncased-Copy-Cov siapa yang mendirikan kadipaten normandia ?
T ransformer Uncased-Copy siapa yang memerintah normandia di normandia ?
mBAR T -Lar ge siapakah kadipaten normandia d i bawah raja normandia ?
IndoBAR T siapa yang memimpin normandia ?
IndoGPT dengan siapa prancis membentuk kadipaten normandia ?
mT5-Small Siapa yang memerintahkan kadip aten normandia?
T able 5: Model Configuration
Model Dataset
Learning
Rate
T raining
Step
V alid Epoch
Batch
Size
T raining
T ime
BiGRU Uncased-Cop-Cov 1.00E-03 32.100 3.210 20 64 55m
BiLSTM Uncased-Cop-Cov 1.00E-03 17.655 3.210 10 64 1h13m
T ransformer Uncased-Cop 1.00E+00 120.600 4.020 300 256 5h40m
mBAR T -L Uncased-Lar ge 1.00E-03 - - 20 8 40h42m
IndoBAR T Uncased-v2 1.00E-03 - - 20 64 7h26m
IndoGPT Uncased 1.00E-03 - - 20 32 10h36m
mT5-Base Uncased-Small 3.00E-05 - - 3 4 8h4m
transformer -based models such as mBAR T and mT5,
as well as the monolingual models built on Indonesian
dataset; IndoBAR T .
5.1 RNN-based vs T ransformer -based
T ransformer -based models outperformed all the RNN-
based models. As seen in T able 6 & T able 7,
T ransformer -based models perform better in gen-
erating natural Indonesi an questions on the T y-
DiQA dataset, which contains 550 pairs of question-
answering Indonesian.
5.1.1 Monolingual vs. Multilingual
Since monolingual language models were pre-trained
using a monolingual dataset, the model resulted in
a lower number of parameters, hence faster training
than multilingual language models. In terms of perfor -
mance, it does not dif fer very much from multilingual
language models and monolingual language models.
5.2 Futur e Impr ovements
The system of building an Indonesian AQG can
achieve better results with a more natural labeled In-
donesian QA or AQ dataset. It should be followed
with more robust preprocessing data to avoid syntacti-
cally incorrect data and biases. Experiments on more
1 12 Informatica 46 (2022) 103–1 18 K. V incentio et al.
T able 6: Model Evaluation Metric Performance Comparison on SQuAD T est Set
Model BLEU 1 BLEU 2 BLEU 3 BLEU 4 A verage
BLEU
ROUGE
L
Epoch
BiGRU 39.56 21.99 1 1.34 5.99 19.72 43.41 20
BiLSTM 39.90 22.23 12.49 5.98 20.08 43.34 10
T ransformer 39.09 21.21 10.83 5.39 19.13 43.69 300
mBAR T -L 53.58 32.41 23.25 17.59 31.71 44.70 20
IndoBAR T 55.03 31.88 22.27 16.42 31.40 46.27 20
IndoGPT 54.07 30.56 21.21 15.72 30.39 44.31 20
mT5-Base 41.13 14.92 7.16 3.86 16.77 40.51 3
T able 7: Model Evaluation Metric Performance Comparison on T yDiQA T est Set
Model BLEU 1 BLEU 2 BLEU 3 BLEU 4 A verage
BLEU
ROUGE
L
Epoch
BiGRU 37.78 18.68 9.15 5.46 17.77 41.92 20
BiLSTM 38.14 18.54 8.90 4.55 17.53 42.59 10
T ransformer 37.23 17.93 8.19 3.40 16.69 41.90 300
mBAR T -L 36.85 15.96 9.56 6.05 17.10 32.64 20
IndoBAR T 38.65 16.01 8.95 5.43 17.26 33.73 20
IndoGPT 35.77 12.55 6.55 3.78 14.66 28.93 20
mT5-Base 32.23 7.98 2.39 0.92 10.88 36.10 3
precise hyperparameters can also help improve getting
the best-performing models.
Future work concerns a deeper analysis of par -
ticular mechanisms and proposals to explore dif fer -
ent techniques. Many other language models vary-
ing in parameter count can be explored for auto-
matic question generation tasks. V arious hyperparam-
eter configurations can be optimized for t he best lan-
guage model, fine-tuning results through hyperparam-
eter tuning. Leveraging d if ferent evaluation metrics
can result in much more comprehensive results to see
the model’ s capabilities within the bigger picture. It
is also worth mentioning that the enhanced SQuAD
dataset from previous research still has much room for
improvement.
no
Refer ences
[1] G. Kurdi, J. Leo, B. Parsia, U. Sattler , and
S. Al-Emari, “A systematic review of automatic
question generation for educational purposes,”
International Journal of Artificial Intelligence in
Education , vol. 30, 1 1 2019. [Online]. A vailable:
https://doi. org/10. 1007/s40593- 019- 00186- y
[2] N.-T . Le, T . Kojiri, and N. Pinkwart, “Automatic
question generation for educational applications
– the state of art,” Advances in Intelligent
Systems and Computing , vol. 282, pp. 325–
338, 01 2014. [Online]. A vailable: https:
//doi. org/10. 1007/978- 3- 319- 06569- 4_24
[3] J. Jamiluddin and V . Ramadayanti, “Develop-
ing Students’ Reading Comprehension Through
Question Generation Strategy ,” e-Journal of
EL TS (English Language T eaching Society) ,
vol. 8, no. 1, Apr . 2020.
[4] S. Cahyawijaya, G. I. W inata, B. W ilie,
K. V incentio, X. Li, A. Kuncoro, S. Ruder ,
Z. Y . Lim, S. Bahar , M. Khodra, A. Pur -
warianti, and P . Fung, “IndoNLG: Bench-
mark and resources for evaluating Indone-
sian natural language generation,” pp. 8875–
8898, Nov . 2021. [Online]. A vailable: https:
//doi. org/10. 18653/v1/2021. emnlp- main. 699
Automatic Question Generation on… Informatica 46 (2022) 103–1 18 1 13
[5] C. A. Nwafor and I. E. Onyenwe, “An
automated multiple-choice question genera-
tion using natural language processing tech-
niques,” International Journal on Natural Lan-
guage Computing , vol. 10, no. 02, p. 1–
10, Apr 2021. [Online]. A vailable: http:
//dx. doi. org/10. 5121/ijnlc. 2021. 10201
[6] A. D. Lelkes, V . Q. T ran, and C. Y u, “Quiz-style
question generation for news stories,” New Y ork,
NY , USA, p. 2501–251 1, 2021. [Online]. A vail-
able: https://doi. org/10. 1145/3442381. 3449892
[7] A. Graesser , V . Rus, S. D’Mello, and G. Jackson,
“Autotutor: Learning through natural language
dialogue that adapts to the cognitive and af fec-
tive states of the learner ,” Curr ent Perspectives
on Cognition, Learning and Instruction: Recent
Innovations in Educational T echnology that Fa-
cilitate Student L earning , pp. 95–125, 0 1 2008.
[8] N.-T . Le, T . Kojiri, and N. Pinkwart, “Automatic
question generation for educational applications
– the state of art,” Advances in Intelligent
Systems and Computing , vol. 282, pp. 325–
338, 01 2014. [Online]. A vailable: https:
//doi. org/10. 1007/978- 3- 319- 06569- 4_24
[9] J. H. W olfe, “Automatic question generation
from text - an aid to independent study ,” in
SIGCSE ’76 , 1976. [Online]. A vailable: https:
//doi. org/10. 1145/952989. 803459
[10] W . Y uan, T . He, and X. Dai, “Improving Neu-
ral Question Generation using Deep Linguis-
tic Representation,” in Pr oceedings of the W eb
Confer ence 2021 . Ljubljana Slovenia: ACM,
Apr . 2021, pp. 3489–3500. [Online]. A vailable:
https://doi. org/10. 1145/3442381. 3449975
[1 1] K. V achev , M. Hardalov , G. Karadzhov ,
G. Geor giev , I. Koychev , and P . Nakov , “Leaf:
Multiple-choice question generation,” CoRR ,
vol. abs/2201.09012, 2022. [Online]. A vailable:
https://doi. org/10. 1007/978- 3- 030- 99739- 7_41
[12] G. Lai, Q. Xie, H. Liu, Y . Y ang, and E. H.
Hovy , “RACE: lar ge-scale reading compre-
hension dataset from examinations,” CoRR ,
vol. abs/1704.04683, 2017. [Online]. A vailable:
https://doi. org/10. 18653/v1/d17- 1082
[13] T . Mihaylov , P . Clark, T . Khot, and
A. Sabharwal, “Can a suit of armor con-
duct electricity? A new dataset for
open book question answering,” CoRR , vol.
abs/1809.02789, 2018. [Online]. A vailable:
https://doi. org/10. 18653/v1/d18- 1260
[14] P . Clark, O. Etzioni, D. Khashabi, T . Khot,
B. D. Mishra, K. Richardson, A. Sabhar -
wal, C. Schoenick, O. T afjord, N. T an-
don, S. Bhakthavatsalam, D. Groeneveld,
M. Guerquin, and M. Schmitz, “From ’f ’
to ’a’ on the N.Y . regents science exams:
An overview of the aristo project,” CoRR ,
vol. abs/1909.01958, 2019. [Online]. A vailable:
https://doi. org/10. 1609/aimag. v41i4. 5304
[15] P . Clark, I. Cowhey , O. Etzioni, T . Khot,
A. Sabharwal, C. Schoenick, and O. T afjord,
“Think you have solved question answering?
try arc, the AI2 reasoning challenge,” CoRR ,
vol. abs/1803.05457, 2018. [Online]. A vailable:
https://doi. org/10. 1609/aaai. v33i01. 33017063
[16] O. T afjord, P . Clark, M. Gardner , W . Y ih, and
A. Sabharwal, “Quarel: A dataset and models
for answering questions about qualitative rela-
tionships,” CoRR , vol. abs/181 1.08048, 2018.
[17] F . C. Akyon, D. Cavusoglu, C. Cengiz,
S. O. Altinuc, and A. T emizel, “Auto-
mated question generation and question an-
swering from T urkish texts using text-to-
text transformers,” arXiv:21 1 1.06476 [cs] , Nov .
2021, arXiv: 21 1 1.06476. [Online]. A vailable:
https://doi. org/10. 55730/1300- 0632. 3914
[18] F . J. Muis and A. Purwarianti, “Sequence-to-
sequence learning for indonesian automatic ques-
tion generator ,” 2020 7th International Confer -
ence on Advanced Informatics: Concepts, The-
ory and Applications, ICAICT A 2020 , 9 2020.
[Online]. A vailable: https://doi. org/10. 1109/
ICAICTA49861. 2020. 9429032
[19] P . Rajpurkar , J. Zhang, K. Lopyrev , and
P . Liang, “SQuAD: 100,000+ questions for
machine compre hension of text,” in Pr oceed-
ings of the 2016 Confer ence on Empirical Meth-
ods in Natural Language Pr ocessing . Austin,
T exas: Association for Computational Linguis-
tics, Nov . 2016, pp. 2383–2392. [Online]. A vail-
able: https://doi. org/10. 18653/v1/D16- 1264
1 14 Informatica 46 (2022) 103–1 18 K. V incentio et al.
[20] J. H. Clark, E. Choi, M. Collins, D. Gar -
rette, T . Kwiatkowski, V . Nikolaev , and
J. Palomaki, “T yDi QA: A benchmark for
information-seeking question answering in ty-
pologically diverse languages,” T ransactions
of the Association for Computational Linguistics ,
vol. 8, pp. 454–470, 2020. [Online]. A vailable:
https://doi. org/10. 1162/tacl_a_00317
[21] M. Schuster and K. Paliwal, “Bidirectional
recurrent neural networks,” IEEE T ransactions
on Signal Pr ocessing , vol. 45, no. 1 1, pp.
2673–2681, Nov . 1997. [Online]. A vailable:
https://doi. org/10. 1109/78. 650093
[22] K. Kriangchaivech and A. W angperawong,
“Question Generation by T ransformers,”
arXiv:1909.05017 [cs] , Sep. 2019, arXiv:
1909.05017.
[23] J. Hu, S. Ruder , A. Siddhant, G. Neubig, O. Fi-
rat, and M. Johnson, “XTREME: A massively
multilingual multi-task benchmark for evalu-
ating cross-lingual generalization,” CoRR , vol.
abs/2003.1 1080, 2020.
[24] P . Colombo, N. Noiry , E. Irurozki, and S. Clé-
mençon, “What are the best systems? new per -
spectives on NLP benchmarking,” CoRR , vol.
abs/2202.03799, 2022.
[25] B. W ilie, K. V incentio, G. I. W inata, S. Cahyaw-
ijaya, X. Li, Z. Y . Lim, S. Soleman, R. Ma-
hendra, P . Fung, S. Bahar , and A. Purwari-
anti, “IndoNLU: Benchmark and resources
for evaluating Indonesian natural language un-
derstanding,” in Pr oceedings of the 1st Confer -
ence of the Asia-Pacific Chapter of the Associa-
tion for Computational Linguistics and the 10th
International Joint Confer ence on Natural Lan-
guage Pr ocessing . Suzhou, China: Asso-
ciation for Computational Linguistics, Dec.
2020, pp. 843–857. [Online]. A vailable: https:
//doi. org/10. 18653/v1/2021. emnlp- main. 699
[26] W . Etaiwi, D. Suleiman, and A. A wajan,
“Deep Learning Based T echniques for Sentiment
Analysis: A Survey ,” Informatica , vol. 45,
no. 7, Dec. 2021. [Online]. A vailable: https:
//doi. org/10. 31449/inf. v45i7. 3674
[27] A. C. Mazari and A. Djef fal, “Sentiment Analy-
sis of Algerian Dialect Using Machine Learning
and Deep Learning with W ord2vec,” Informat-
ica , vol. 46, no. 6, Jul. 2022. [Online]. A vailable:
https://doi. org/10. 31449/inf. v46i6. 3340
[28] S. T . Al-Otaibi and A. A. Al-Rasheed, “A
Review and Comparative Analysis of Sentiment
Analysis T echniques,” Informatica , vol. 46,
no. 6, Jul. 2022. [Online]. A vailable: https:
//doi. org/10. 31449/inf. v46i6. 3991
[29] D. Suleiman, A. Odeh, and R. Al-Sayyed,
“Arabic Sentiment Analysis Using Naïve Bayes
and CNN-LSTM,” Informatica , vol. 46, no. 6,
Jul. 2022. [Online]. A vailable: https://doi. org/
10. 31449/inf. v46i6. 4199
[30] A. A. Al-Rasheed, “Finding Influential Users
in Social Networking using Sentiment Analy-
sis,” Informatica , vol. 46, no. 5, Mar . 2022.
[Online]. A vailable: https://doi. org/10. 31449/
inf. v46i5. 3829
[31] A. W ang, A. Singh, J. Michael, F . Hill,
O. Levy , and S. Bowman, “GLUE: A
multi-task benchmark and analysis platform
for natural language understanding,” in Pr o-
ceedings of the 2018 EMNLP W orkshop Black-
boxNLP: Analyzing and Interpr eting Neural Net-
works for NLP . Brussels, Belgium: Asso-
ciation for Computational Linguistics, Nov .
2018, pp. 353–355. [Online]. A vailable:
https://doi. org/10. 18653/v1/W18- 5446
[32] S. Gehrmann, T . Adewumi, K. Aggarwal, P . S.
Ammanamanchi, A. Aremu, A. Bosselut, K. R.
Chandu, M.-A. Clinciu, D. Das, K. Dhole,
W . Du, E. Durmus, O. Dušek, C. C. Emezue,
V . Gangal, C. Garbacea, T . Hashimoto, Y . Hou,
Y . Jernite, H. Jhamtani, Y . Ji, S. Jolly , M. Kale,
D. Kumar , F . Ladhak, A. Madaan, M. Maddela,
K. Mahajan, S. Mahamood, B. P . Majumder ,
P . H. Martins, A. McMillan-Major , S. Mille,
E. van Miltenbur g, M. Nadeem, S. Narayan,
V . Nikolaev , A. Niyongabo Rubungo, S. Osei,
A. Parikh, L. Perez-Beltrachini, N. R. Rao,
V . Raunak, J. D. Rodriguez, S. Santhanam, J. Se-
doc, T . Sellam, S. Shaikh, A. Shi morina, M. A.
Sobrevilla Cabezudo, H. Strobelt, N. Subramani,
W . Xu, D. Y ang, A. Y erukola, and J. Zhou, “The
GEM benchmark: Natural language generation,
its evaluation and metrics,” in Pr oceedings of
the 1st W orkshop on Natur al Language Genera-
Automatic Question Generation on… Informatica 46 (2022) 103–1 18 1 15
tion, Evaluation, and Metrics (GEM 2021) . On-
line: Association for Computational Linguistics,
Aug. 2021, pp. 96–120. [Online]. A vailable:
https://doi. org/10. 18653/v1/2021. gem- 1. 10
[33] M. Lewis, Y . Liu, N. Goyal, M. Ghazvinine-
jad, A. Mohamed, O. Levy , V . Stoyanov , and
L. Zettlemoyer , “BAR T : Denoising sequence-
to-sequence pre-training for natural language
generation, translation, and comprehension,” in
Pr oceedings of the 58th Annual Meeting of the
Association for Computational Linguistics . On-
line: Association for Computational Linguistics,
Jul. 2020, pp. 7871–7880 . [Online]. A vailable:
https://doi. org/10. 18653/v1/2020. acl- main. 703
[34] T . W olf, L. Debut, V . Sanh, J. Chaumond,
C. Delangue, A. Moi, P . Cistac, T . Rault,
R. Louf, M. Funtowicz, J. Davison, S. Shleifer ,
P . von Platen, C. Ma, Y . Jernite, J. Plu,
C. Xu, T . Le Scao, S. Gugger , M. Drame,
Q. Lhoest, and A. Rush, “T ransformers:
State-of-the-art natural language processing,”
in Pr oceedings of the 2020 Confer ence on Em-
pirical Methods in Natural Language Pr ocess-
ing: System Demonstrations . Online: Asso-
ciation for Computational Linguistics, Oct.
2020, pp. 38–45. [Online]. A vailable: https:
//doi. org/10. 18653/v1/2020. emnlp- demos. 6
[35] Y . Liu, J. Gu, N. Goyal, X. Li, S. Edunov ,
M. Ghazvininejad, M. Lewis, and L. Zettle-
moyer , “Multilingual Denoising Pre-training
for Neural Machine T ranslation,” T ransactions
of the Association for Computational Linguistics ,
vol. 8, pp. 726–742, 1 1 2020. [Online]. A vail-
able: https://doi. org/10. 1162/tacl_a_00343
[36] P . Rajpurkar , R. Jia, and P . Liang, “Know
what you don’ t know: Unanswerable ques-
tions for SQuAD,” in Pr oceedings of the 56th
Annual Meeting of the Association for Compu-
tational Linguistics (V olume 2: Short Papers) .
Melbourne, Australia: Association for Com-
putational Linguistics, Jul. 2018, pp. 784–789.
[Online]. A vailable: https://doi. org/10. 18653/
v1/P18- 2124
[37] J. Devlin, M.-W . Chang, K. Lee, and
K. T outanova, “BER T : Pre-training of deep bidi-
rectional transformers for language understand-
ing,” in Pr oceedings of the 2019 Confer ence
of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Lan-
guage T echnologies, V olume 1 (Long and Short
Papers) . Minneapolis, Minnesota: Asso-
ciation for Computational Linguistics, Jun.
2019, pp. 4171–4186. [Online]. A vailable:
https://doi. org/10. 18653/v1/N19- 1423
[38] T . B. Brown, B. Mann, N. R yder , M. Sub-
biah, J. Kaplan, P . Dhariwal, A. Neelakan-
tan, P . Shyam, G. Sastry , A. Askell, S. Agar -
wal, A. Herbert-V oss, G. Krueger , T . Henighan,
R. Child, A. Ramesh, D. M. Ziegler , J. W u,
C. W inter , C. Hesse, M. Chen, E. Sigler ,
M. Litwin, S. Gray , B. Chess, J. Clark, C. Berner ,
S. McCandlish, A. Radford, I. Sutskever , and
D. Amodei, “Language Models are Few-Shot
Learners,” in Advances in Neural Information
Pr ocessing Systems , H. Larochelle, M. Ranzato,
R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33.
Curran Associates, Inc., Jul. 2020, pp. 1877–
1901, arXiv: 2005.14165.
[39] Y . T ang, C. T ran, X. Li, P .-J. Chen, N. Goyal,
V . Chaudhary , J. Gu, and A. Fan, “Multi-
lingual translation from denoising pre-training,”
in Findings of the Association for Computational
Linguistics: ACL-IJCNLP 2021 . Online: As-
sociation for Computational Linguistics, Aug.
2021, pp. 3450–3466. [Online]. A vailable: https:
//doi. org/10. 18653/v1/2021. findings- acl. 304
[40] C. Raf fel, N. Shazeer , A. Roberts, K. Lee,
S. Narang, M. Matena, Y . Zhou, W . Li, and P . J.
Liu, “Exploring the limits of transfer learning
with a unified text-to-text transformer ,” Journal
of Machine Learning Resear ch , vol. 21, no. 140,
pp. 1–67, 2020.
[41] L. Xue, N. Constant, A. Roberts, M. Kale,
R. Al-Rfou, A. Siddhant, A. Barua, and C. Raf-
fel, “mT5: A massively multilingual pre-trained
text-to-text transformer ,” in Pr oceedings of the
2021 Confer ence of the North American Chapter
of the Association for Computational Linguistics:
Human Language T echnologies . Online: As-
sociation for Computational Linguistics, Jun.
2021, pp. 483–498. [Online]. A vailable: https:
//doi. org/10. 18653/v1/2021. naacl- main. 41
[42] M. E. Peters, M. Neumann, M. Iyyer , M. Gard-
ner , C. Clark, K. Lee, and L. Zettle-
1 16 Informatica 46 (2022) 103–1 18 K. V incentio et al.
moyer , “Deep contextualized word repre-
sentations,” in Pr oceedings of the 2018 Con-
fer ence of the North American Chapter of the
Association for Computational Linguistics: Hu-
man Language T echnologies, V olume 1 (Long
Papers) . New Orleans, Louisiana: Asso-
ciation for Computational Linguistics, Jun.
2018, pp. 2227–2237. [Online]. A vailable:
https://doi. org/10. 18653/v1/N18- 1202
[43] A. V aswani, N. Shazeer , N. Parmar , J. Uszkoreit,
L. Jones, A. N. Gomez, L. Kaiser , and I. Polo-
sukhin, “Attention Is All Y ou Need,” in Pr o-
ceedings of the 31st International Confer ence
on N eural Information Pr ocessing Systems , ser .
NIPS’17. Red Hook, NY , USA: Curran Asso-
ciates Inc., 2017, p. 6000–6010.
Automatic Question Generation on… Informatica 46 (2022) 103–1 18 1 17
Appendix A
T able 8: T ranslated texts from Indonesian to English for all Indonesian texts mentioned above.
# Indonesian English
1
Samudra Pasifik atau Lautan T eduh (dari bahasa
spanyol Pacifico, artinya tenang) adalah
kawasan kumpulan air terbesar di dunia, serta
mencakup kira-kira sepertiga permukaan Bumi,
dengan luas sebesar 179,7 juta km2 (69,4 juta
mi2). Panjangnya sekitar 15.500 km (9.600mi)
dari Laut Bering di Arktik hingga batasan es di
Laut Ross di Antartika di selatan. Samudra
Pasifik mencapai lebar timur -barat terbesarnya
pada sekitar 5 derajat U garis lintang, di mana
ia terbentang sekitar 19.800 km (12.300mi)
dari Indonesia hingga pesisir Kolombia. Batas
sebelah barat samudra ini biasanya diletakkan
di Selat Malaka. T itik ter endah permukaan Bumi—
Palung Mariana—berada di Samudra Pasifik.
Samudra ini terletak di antara Asia dan
Australia di sebelah barat, Amerika di sebelah
timur , Antartika di sebelah selatan dan Samudra
Arktik di sebelah utara.
The Pacific Ocean or Ocean of Shades
(fr om the Spanish Pacifico, meaning calm)
is the lar gest ar ea of water body in the
world, and covers about a thir d of the
Earth’ s surface, with an ar ea of 179.7
million km2 (69.4 million mi2). It extends
about 15,500 km (9,600mi) fr om the
Bering Sea in the Ar ctic to the ice cap of
the Ross Sea in Antar ctica in the south.
The Pacific Ocean r eaches its gr eatest
east-west width at about 5 degr ees N
latitude, wher e it extends about 19,800 km
(12,300mi) fr om Indonesia to the coast of
Colombia. The western boundary of this
ocean is usually placed in the Malacca
Strait. The lowest point on Earth’ s
surface—the Mariana T r ench—is in the
Pacific Ocean. This ocean is located between
Asia and Australia to the west, America to
the east, Antar ctica to the south and the
Ar ctic Ocean to the north.
2 179,7 juta km2 179.7 million km2
3 Berapa luas Samudera Pasifik? How wide is the Pacific Ocean?
4 berapa luas samudra pasifik ? how wide is the pacific ocean?
5 berapa luas bumi pasifik ? how big is the pacific earth?
6 berapa luas air terbesar di dunia ? what is the lar gest ar ea of water in the world?
7 berapakah luas samudra pasifik? how wide is the pacific ocean?
8 berapakah luas samudra pasifik? how wide is the pacific ocean?
9 berapa luas total wilayah lautan pasifik ? What is the total ar ea of the Pacific Ocean?
10 Berapa luas samudra pasifik? How wide is the Pacific Ocean?
1 1
Kadipaten Normandia , yang mer eka bentuk
dengan perjanjian dengan mahkota Prancis,
adalah tanah yang indah bagi Prancis abad
pertengahan , dan di bawah Richar d I dari
Normandia ditempa menjadi sebuah
pemerintahan yang kohesif dan tangguh
dalam masa jabatan feodal.
The Duchy of Normandy , which they formed
by tr eaty with the Fr ench cr own, was a
beautiful land for medieval France, and
under Richar d I of Normandy was for ged
into a cohesive and formidable government
in feudal tenur e.
12 Kadipaten Normandia Duchy of Normandy
13 Siapa yang memerintah kadipaten Normandia Who ruled the duchy of Normandy
14 siapa yang memerintah pemerintahan normandia ? who governs the normandy government?
15 siapa yang mendirikan kadipaten normandia ? who founded the duchy of normandy?
1 18 Informatica 46 (2022) 103–1 18 K. V incentio et al.
Appendix B
T able 9: T ranslated texts from Indonesian to English for all Indonesian texts mentioned above.
# Indonesian English
16
siapa yang memerintah normandia di
normandia ?
who rules normandy in normandy ?
17
siapakah kadipaten normandia di bawah
raja normandia ?
who is the duchy of normandy under
the king of normandy ?
18 siapa yang memimpin normandia ? who is in char ge of normandy?
19
dengan siapa prancis membentuk kadipaten
normandia ?
W ith whom did France form the
Duchy of Normandy?
20 Siapa yang memerintahkan kadipaten normandia? Who ruled the duchy of normandy?