101
Learning languages from parallel 
corpora: a blueprint for turning 
corpus examples into language 
learning exercises 
Johannes GRAËN
Institute of Computational Linguistics, University of Zurich & 
Department of Swedish, University of Gothenburg
This work describes a blueprint for an application that generates language 
learning exercises from parallel corpora. Word alignment and parallel structures 
allow for the automatic assessment of sentence pairs in the source and target 
languages, while users of the application continuously improve the quality of the 
data with their interactions, thus crowdsourcing parallel language learning ma-
terial. Through triangulation, their assessment can be transferred to language 
pairs other than the original ones if multiparallel corpora are used as a source.
Several challenges need to be addressed for such an application to work, 
and we will discuss three of them here. First, the question of how adequate learn-
ing material can be identified in corpora has received some attention in the last 
decade, and we will detail what the structure of parallel corpora implies for that 
selection. Secondly, we will consider which type of exercises can be generated 
automatically from parallel corpora such that they foster learning and keep learn-
ers motivated. And thirdly, we will highlight the potential of employing users, that 
is both teachers and learners, as crowdsourcers to help improve the material.
Keywords: ICALL, language learning exercises, parallel corpora, data-driven 
learning
Graën, J.: Learning languages from parallel corpora: a blueprint for turning corpus 
examples into language learning exercises. Slovenščina 2.0, 10(2): 101–131. 
1.01 Izvirni znanstveni članek / Original Scientific Article
DOI: https://doi.org/10.4312/slo2.0.2022.2.101-131
https://creativecommons.org/licenses/by-sa/4.0/
102
Slovenščina 2.0, 2022 (2) | Articles
1 Overview
The generation of language learning exercises based on parallel corpus 
material requires the combination of several techniques and strategies. 
First of all, in order to automatically assess corpus material regarding 
its suitability for language learning exercises, we need to annotate it us-
ing standard techniques of Natural Language Processing (NLP), such as 
tokenization, lemmatization, part-of-speech tagging, and named entity 
recognition. In addition, we want to annotate the vocabulary used in 
those examples with the lowest proficiency level required to compre-
hend single lexical items of the target language that the learners want 
to acquire. The use of NLP techniques for computer-assisted language 
learning (CALL) is commonly referred to as ICALL (intelligent CALL) due 
to the numerous components of artificial intelligence (AI) that are ap-
plied in NLP methods (Lu, 2018).
Concerning parallel corpora (Section 2), we can take advantage of 
the expected parallelism between individual corpus units in the target 
language and the native language (L1) of the learner, or another foreign 
language (L2) in which the learner is sufficiently proficient. The lat-
ter case might be advantageous if there is a close typological relation 
between the target language and the L2. Take, for instance, a Finnish 
learner of Portuguese, who is already an advanced learner of Italian. In 
that case, examples from a parallel corpus of Portuguese/Italian will 
likely have more similarities regarding vocabulary and structure than a 
parallel corpus of Portuguese/Finnish.
The adequacy of the corpus material in particular sentences for dif-
ferent learner proficiency levels has received considerable attention in 
recent years (Pilán, 2018; Tack, 2021). A multitude of factors determine 
whether learners of a particular proficiency level are likely to compre-
hend a sentence or not. In the case of parallel sentence pairs, we will not 
only estimate the required proficiency level for each of the sentences 
individually, but also take into account the way it has been translated, 
independent of the translation direction. Employing interlingual word-
level correspondences and intralingual syntactic relations between sin-
gle words, we will derive grammatical correspondences, which, in turn, 
can be classified in terms of proficiency levels (Section 3).
103
Learning languages from parallel corpora: a blueprint for turning corpus examples...
Data-driven learning (Section 4) is a well-explored technique sup-
porting language learner autonomy. The main idea is to let learners ex-
plore authentic language material on their own, which will make them 
observe patterns, turn those into hypotheses and then corroborate 
these with the help of search tools. Those patterns can relate to any 
linguistic level, such as lexicon, morphology, or syntax. While the idea 
of learning languages utilizing language material (as opposed to learn-
ing by prescribed rules) has been around for several decades, and its 
efficacy has been experimentally substantiated, the use of parallel cor-
pora for that purpose has received significantly less attention (Lawson, 
2001; Bluemel, 2014; Montero Perez et al. 2014, to name a few).
Learners benefit from corpus tools that are easy to use and visually 
help them to explore the respective content. Corpus search activities 
are either learner-driven, in the case of autonomous learners or open 
exercises, or instructor-driven, when learners are given concrete tasks 
to perform. While a learner already needs to have acquired a certain 
level of autonomy for the former case, the latter requires some form 
of feedback from the teacher in case the learners have not understood 
the motivation behind those tasks. That is why we are going one step 
further and use sentence pairs retrieved from corpora for generating 
language learning exercises (Section 5). Having annotated and aligned 
parallel sentences facilitates a whole new range of exercise types.
The term crowdsourcing is often associated with the idea of a large 
number of people doing voluntary work. Voluntariness, however, needs 
to be seen with respect to the motivation of the volunteers. Whether 
they are contributing out of interest, are getting paid for their work, or 
need to participate for other reasons (e.g. to pass a course) makes a 
difference concerning the results we expect to get. In addition to mo-
tivation, we can distinguish, whether crowdsourcers are consciously 
contributing or not, and thus providing explicit or implicit feedback 
(Wang et  al., 2019). As opposed to amateur scientists participating 
in research projects, which is typically referred to as “citizen science”, 
crowdsourcers can be lay people with no expert knowledge (Section 6).
Having briefly discussed all the relevant topics, we proceed to describe 
the envisaged architecture for the application in Section 7 addressing the 
previously described challenges. The corpus retrieval functionality has 
104
Slovenščina 2.0, 2022 (2) | Articles
been implemented and fed with parallel sentences from the OpenSub-
titles corpus (Lison and Tiedemann, 2016) in 21 language pairs, namely 
every combination of the Catalan, English, French, German, Italian, Span-
ish and Swedish part of that corpus. We named it PaCLE (Parallel Corpora 
for Language Learning Exercises) and used it in several experiments, one 
of which we describe in Graën et al. (in press).
2 Parallel corpora
In a previous work (Zanetti, Volodina, and Graën 2021), we describe 
two challenges of automated exercise generation, namely reducing the 
ambiguity of generated exercises with the help of NLP methods, and 
the selection of appropriate sentences from corpora. In both cases, 
parallel corpora will be of great avail.
Parallel corpora consist of at least two datasets that refer to the 
same sequence of language material. The typical cases are bilingual or 
multilingual corpora, where those datasets correspond to translations 
of some material. The original material can be one of the datasets but 
does not necessarily need to be part of the corpus. As for the material, 
most parallel corpora consist of plain text, but parallel corpora of au-
dio recordings also exist, which are often accompanied by transcripts, 
such as the Parallel Audiobook Corpus1 (Ribeiro 2018). What is more, 
corpora consisting of several layers in the same language, such as the 
just-mentioned Parallel Audiobook Corpus which comprises record-
ings of different speakers reading the same books, also meet the con-
dition of parallelism. Finally, learner corpora that comprise not only the 
learners’ writings but also a normalized or corrected version of their 
text productions are also covered by the term parallel corpus.
Unlike parallel corpora, so-called comparable corpora do not nec-
essarily possess parallel structures, but merely share the same top-
ics per corresponding unit (e.g.,  articles). Wikipedia2 can be seen as 
a comparable corpus, since a correspondence relation between lan-
guages can be established for individual articles (McEnery and Xiao, 
2007; Otero and López, 2010; Barrón-Cedeno et al., 2015).
1 https://datashare.is.ed.ac.uk/handle/10283/3217
2 https://www.wikipedia.org/
105
Learning languages from parallel corpora: a blueprint for turning corpus examples...
2.1 Sources
Many parallel corpora have been made freely available over the last 
few decades. The largest source of parallel corpus material is arguably 
the OPUS collection3 (Tiedemann, 2009, 2012). We have recompiled 
a small number of existing parallel corpora of different text types and 
languages (including low-resource languages such as Romansh and 
Swiss German) into a common format that allows for hierarchical cor-
respondence annotation (Graën 2018) on any of the three levels that 
each of the individual corpora has, namely documents, sentences and 
words (i.e. tokens) (Graën et al., 2019).
At first, parallel corpora were compiled from publicly available 
translations. In several countries with more than one official language, 
documents from the respective authorities need to be translated from 
their original language to all other official ones. Typical examples of 
such corpora are the Canadian Hansards (Gale and Church, 1991, 
1993), parliamentary debates in English and French, or the Belgisch 
Staatsblad (Vanallemeersch 2010), publications from the Belgian gov-
ernment in Dutch and French. In countries like Switzerland with three 
official languages (on the federal level) and multinational organizations 
such as the United Nations or the European Union, multilingual transla-
tions are produced that can and have been turned into corpora (Koehn, 
2005; Rafalovitch et al., 2009; Eisele and Chen, 2010; Volk et al., 2010, 
2016; Scherrer et al., 2014; Ziemski et al., 2016).
2.2 Alignment
The individual correspondence of textual units (e.g.  sentences or 
words) is called an alignment, as is the process of deriving these corre-
spondence relations. While the correspondence on the document level 
is typically derived by metadata (e.g. book chapters, webpages, exter-
nal identifiers such as numbers assigned to documents), the identifica-
tion of corresponding sentences and words requires dedicated tools. 
The performance of sentence alignment depends to a large part on 
how many one-to-one correspondences there are – that is, one sen-
tence in one language translated to exactly one sentence in the other 
3 https://opus.nlpl.eu/
106
Slovenščina 2.0, 2022 (2) | Articles
language. If there are numerous one-to-many relations or sentences 
without correspondence in the other language, so-called null align-
ments, the alignment performance can be significantly lower. A num-
ber of commonly used tools and methods exist to improve alignment 
performance (e.g. Varga et al., 2005; Braune and Fraser, 2010; Senn-
rich and Volk, 2010), and new methods keep being developed (Thomp-
son and Koehn, 2019; Jiang et al., 2020).
For word alignment, the respective language pairs play an impor-
tant role. As a rule of thumb, languages with similar structures and 
word formation yield better results. If bilingual alignments of more 
than two languages are combined, two scenarios are possible. Either 
all alignments agree, which suggests good quality of the individual 
bilingual alignments, or there are discrepancies between the pairwise 
alignments, which indicates that one or more of the alignments are 
erroneous, as not all identified correspondences can be correct in this 
case (cf. Graën et  al., 2019). An approach of rotating triangulation 
can be used in this case to combine several bilingual alignments into 
a single harmonized multilingual one, and thus improve alignment 
quality.
In the same vein, the combination of different alignment tech-
niques helps improve alignment quality. Ensemble methods such as 
the one presented by Steingrı́msson, Loftsson, and Way (2021) have 
an advantage over the individual alignment methods, as seen in per-
formance metrics such as the score or the alignment error rate (see 
Tiedemann, 2011, Section 2.6). Modern sentence aligners achieve 
better results by employing pre-trained multilingual neural language 
models (see Jalili Sabet et al., 2020; Dou and Neubig, 2021).
Alignment information in a corpus can be aggregated to derive a 
distribution from a single lexical unit in the source language to differ-
ent units in the target language. The translation variants determined 
and quantified in this way help us select the right context, including 
word sense (see Section 3). We used these distributions to calculate a 
semantic relation between word pairs by means of translation variants 
(Graën and Schneider, 2020). Figure 1 shows a visualization from the 
tool that we created for learners to explore the semantics of translation 
variants from corpora.
107
Learning languages from parallel corpora: a blueprint for turning corpus examples...
Figure 1: Shared and unique translation variants for English ‘stay’ and Spanish ‘quedarse’ 
in various languages. Word frequencies are expressed by the size of nodes and alignment 
probabilities by the thickness of edges. Individual languages are color-coded.
3 Learner proficiency
Like any other skill, learning a language starts with the first contact with 
the target, and eventually ends with its mastery. In between, there is 
a continuum that can be subdivided into a scale of proficiency levels 
defined by capabilities that a learner is required to achieve. Several 
standards of scaling exist and can be approximately mapped to each 
other, as they all define waypoints on the journey of acquiring a foreign 
language.
The proficiency of an individual learner can be measured in several 
dimensions, the two most prominent ones being reception vs. produc-
tion and oral vs. written. The Common European Framework of Refer-
ence for Languages (CEFR) (Council of Europe 2001) subdivides “lan-
guage activities” into reception and production as primary activities 
108
Slovenščina 2.0, 2022 (2) | Articles
and interaction and mediation as secondary ones (Council of Europe 
2001, Section 2.1.3).
Figure 2 replicates Figure 1 from the Common European Frame-
work of Reference for Languages, which divides the proficiency scale 
into three coarse-grained levels (basic, independent and proficient 
user), each of which is further subdivided into two levels. We will 
henceforth refer to the six levels from A1 to C2 as CEFR levels. The 
CEFR scale has become a ubiquitous measure of language learning 
proficiency, and courses now indicate which level can be obtained after 
successfully finishing them, while job offers use them to specify profi-
ciency requirements.
Figure 2: The “Common Reference Levels” as defined by (Council of Europe, 2001).
In the field of CALL, a multitude of research has been using the 
CEFR levels for various purposes, e.g.  for the classification of texts 
(see Pilán et al., 2017) or the prediction of learner proficiency (Gail-
lat et al., 2022). The CEFRLex project4 (François et al., 2016) provides 
mappings from lexical entries to distributions of CEFR levels for several 
languages. Those distributions stem in most cases from an analysis of 
textbooks. Each textbook is dedicated to a particular proficiency level, 
and the appearance of lexical entries (words and expressions) in the 
respective textbooks is represented as a frequency distribution. This 
distribution undergoes a normalization step to account for peaks of 
low-frequency entries, which is typically due to particular topics involv-
ing those entries (Dürlich and François, 2018).
We compared the English EFLLex from the CEFRLex resourc-
es (Dürlich and François, 2018) with two other lexical resources for 
4 https://cental.uclouvain.be/cefrlex/
109
Learning languages from parallel corpora: a blueprint for turning corpus examples...
English, namely the Pearson Global Scale of English (Pearson, 2017) 
and the Cambridge English Vocabulary Profile (Cambridge University 
Press, 2015), and found that they all agree to a large extent regarding 
the assigned CEFR level per lexical entry (Graën et al., 2020). The main 
difference between EFLLex and the other two resources is that the lat-
ter distinguish word senses, from which we had to abstract away for 
the sake of comparability by choosing the lowest level per entry, which 
typically corresponds to the most frequently used sense.
The word “stay” with the sense “to live in a place for a short time 
as a visitor or guest”, for example, is classified by the Global Scale of 
English as beginner level (A1) on the CEFR scale. The same word is also 
used with the sense “to continue to be in a particular state, and not 
change”, which is classified as an intermediate level (B1). Multiword 
expressions such as the phrasal verbs “stay on” or “stay out of” rank 
even higher (B2).
Apart from lexical resources, the frequency of a lexical unit in a 
general corpus and its length in terms of characters are also good in-
dicators for the corresponding proficiency level. The relation between 
these two properties is illustrated by Zipf’s law of abbreviation: shorter 
words are more frequently used and frequently used words tend to be 
shorter in general.
In addition to comparing EFLLex with other English resources, 
we also proved the hypothesis that “similar words in two languages, 
i.e. good direct translations, should have similar CEFR levels” (Graën 
et al., 2020, Section 3.5) by combining three monolingual CEFRLex re-
sources, namely EFLLex for English, FLELex for French (François et al., 
2014) and SVALex for Swedish (François et al., 2016), into one multilin-
gual resource with the help of alignment probabilities obtained from a 
large parallel corpus (Graën, 2018), which we then used together with 
the raw CEFR level provided by EFLLex to predict the CEFR level of lexi-
cal entries from the above-mentioned lexical resources, the Pearson 
Global Scale of English and the Cambridge English Vocabulary Profile.
With the knowledge of how to identify words in different languag-
es whose CEFR levels are strongly correlated, we can use one of the 
 CEFRLex resources to project CEFR levels from one language onto an-
other for which no equivalent resource exists. For multilingual corpora, 
110
Slovenščina 2.0, 2022 (2) | Articles
as a matter of course we can project jointly from several languages for 
which CEFR-graded lexical resources are available.
4 Data-driven learning
A typical way for a learner to start learning an unfamiliar language is 
through language classes with the help of textbooks. Once an exer-
cise in the textbook has been solved, however, it cannot be reused in 
a meaningful way, as doing exactly the same exercise more than once 
is a tedious task. To keep learners motivated, teachers need not only 
to have access to a large repertoire of different learning activities, in-
cluding exercises, but also need a constant supply of novel language 
material.
A quarter of a century ago, Wilson (1997) identified “two major 
problems” in creating a language course. Both have to do with the avail-
ability of sufficient language learning material. The first one is about 
meeting “the needs of students with different abilities”, while the sec-
ond one addresses the need to provide “enough exercises to ensure 
that a student is confronted by a different set of examples whenever he 
or she uses the language learning program”. In Wilson’s view, “corpora 
present a unique and unexploited resource” in this context.
Boulton and Cobb (2017) performed a meta-analysis of publica-
tions studying the effects of data-driven learning, and concluded that 
this technique is both efficient and effective. In a previous study on 
the same topic (Cobb and Boulton, 2015), the authors state that for 
data-driven learning to succeed, “massive but controlled exposure to 
authentic input is of major importance, as learners gradually respond 
to and reproduce the underlying lexical, grammatical, pragmatic, and 
other patterns implicit in the languages they encounter”.
5 Language learning exercises
Language learning exercises aim at improving the language skills of 
learners, which, at first glance, seems to be an obvious truism, though 
not all exercises are equally effective in all contexts. Under some con-
ditions, the learning effect can be small to nonexistent, if, for example, 
the learner is overchallenged by an exercise and cannot solve it. Laufer 
111
Learning languages from parallel corpora: a blueprint for turning corpus examples...
and Ravenhorst-Kalovski (2010) evaluate the vocabulary size required 
for an “adequate reading comprehension” of regular texts in a foreign 
language, but also underline that the text type plays a role in this, and 
that texts with “a large proportion of technical and jargon vocabulary” 
might be more challenging to comprehend. On the other extreme, un-
derchallenging the learner can also lead to them quickly losing motiva-
tion (Mousavian Rad et al., 2022).
For learning to be effective, exercises should thus be neither too 
simple nor too difficult for the learner in question. Language learners 
differ in various dimensions, e.g.  in age (from elementary school pu-
pils to language students at university level, or adult learners), motiva-
tion (intrinsic or extrinsic), current proficiency level in the target lan-
guage (beginner to advanced), previous language learning experiences 
(e.g. of similar L2s), their metalinguistic knowledge, etc. Furthermore, 
the settings in which exercises are done also vary: in-class exercises 
vs.  exercises done at home, individual or group exercises, low-stake 
(ungraded) vs. high-stake (graded) activities, and so on.
In the best case, teachers take into account all these properties 
when devising exercises as part of the curriculum, which, optimally, 
consists of complementary exercises and planned repetitions (cf. Na-
tion and Webb, 2011; Nakata and Webb, 2016).
5.1 Limitations for automatically generated exercises
When it comes to generating language learning exercises automatical-
ly, that is by an algorithm instead of a human, only a small number of all 
possible exercise types are eligible, and even fewer can be reliably as-
sessed programmatically. First of all, we want to limit ourselves to the 
interaction of a single learner with the (interactive) exercise. Observing 
a group of learners when they are interacting, e.g. in a role-play exer-
cise, and providing feedback to the individual participants is something 
that language teachers are used to; this is, however, far beyond what 
can be automated today, despite the continuous advance of language 
technology. If human-human interaction is our target, communication 
is best channeled through the computer and the exercise is defined in 
a way such that communication is mostly controlled by the software. 
112
Slovenščina 2.0, 2022 (2) | Articles
This kind of language learning has been the subject of several publi-
cations in the field of computer-mediated communication (CMC). Ac-
cording to Heift and Vyatkina (2017), “CMC has shown to have many 
features similar to face-to-face language classroom interactions such 
as clarification requests and feedback”.
Another limitation to note is that we will exclusively work with writ-
ten text. Oral exercises require additional technologies, speech rec-
ognition for productive exercises and speech generation for receptive 
ones, which add to the likelihood of the software making a mistake 
when generating the exercise or assessing the user input. There are, 
however, existing tools for supporting the oral part of language learn-
ing, e.g. in the area of computer-assisted pronunciation training (CAPT) 
(Fouz-González, 2015; Schwab and Goldman, 2018).
Our third and last limitation concerns the user input. Natural lan-
guage processing techniques are – in their current state – not capable 
of semantically interpreting free-form answers reliably, especially if the 
input provided, which is the users’ textual output, deviates significantly 
from the training material, which for a large share of the available lan-
guages still are newspaper texts and other official documents. Texts 
produced by language learners comprising potentially innovative lexi-
cal and grammatical components typically yield a significantly higher 
error rate when being processed by such models. Assuming that we 
could process texts produced by learners without making annotation 
errors, we would still struggle to provide learners with the helpful feed-
back that a human teacher could. Existing tools that accept free-form 
textual input provide selective feedback on spelling and grammatical 
constructions. A machine-generated exercise where the learner con-
tinues a story for which only the beginning is given – with automated 
feedback provided by an algorithm on writing style, text structure, and 
word choice– is unlikely to be available soon.
5.2 Exercises from parallel corpora
As we have annotated corpus material, we can support the compre-
hension of text by simple means, such as color-coding different parts 
of speech, showing additional information when the user hovers over a 
113
Learning languages from parallel corpora: a blueprint for turning corpus examples...
particular token, interactively displaying syntactic relations (e.g. mark-
ing subject and object relations of verbs or pointing out the respec-
tive base verbs for separated particles in languages such as German or 
Swedish). In parallel corpora, we can also highlight translation equiva-
lents with the help of alignments (as we do in multilingwis, see Clema-
tide et al., 2016; Graën et al., 2017) or combine alignments and syntax 
to retrieve meaningful chunks of words (as in Zanetti et al., 2021).
In an earlier work (Alfter and Graën, 2019) we present the proto-
type of a game to train particle verbs in English and Swedish. A virtual 
currency is used for motivational purposes. The user earns credits for 
correctly guessed particles and loses them if they are wrong, while dif-
ferent types of hints can be “bought” by using credits. Parallel data 
used by the application is extracted from the CoStEP corpus (Graën 
et al., 2014), which is based on Europarl (Koehn, 2005), and annotated 
in an unsupervised way. Particle verbs are classified with respect to 
their proficiency level based on EFLLex (Dürlich and François, 2018) 
and SVALex (François et al., 2016).
Our work described in Zanetti, Volodina, and Graën (2021) intro-
duces a novel type of sentence reordering exercise. We address the 
issue of potentially erroneous alignment of function words and the 
(sometimes) unclear correspondence of functional parts by merging 
single tokens to chunks based on their syntactic relations. We extract-
ed sentences from the OpenSubtitles corpus (Lison and Tiedemann, 
2016), processed them with standard natural language processing 
pipelines, and used language-specific readability measures to esti-
mate the complexity of sentences.5
6 Crowdsourcing
A crowdsourcing application known by many people is “recaptcha” 
(Von Ahn et al., 2008), a word recognition task that users have to solve 
before they are allowed to proceed to the actual web content they re-
quested. These puzzles have a dual purpose: by solving them, the users 
primarily prove that they are human, but at the same time they provide 
5 A prototype of the envisaged exercise type can be tested here: https://codepen.io/gi0/pen/
vYLJYjp.
114
Slovenščina 2.0, 2022 (2) | Articles
human judgments on words that are unknown to the recaptcha system, 
thus contributing to a dataset that can be used to train OCR algorithms.
Apart from this prototypical example, where crowdsourcing is used 
“along the way”, there are tools for creating crowdsourcing experi-
ments and having people solve a large number of tasks.6 Users of those 
applications typically spend a considerable amount of time performing 
a large number of tasks. Here, the recruitment of crowdsourcers plays 
a key role. One can disseminate information and ask people to volun-
teer, or require university students to contribute a particular number 
of tasks, as is frequently done for publications about crowdsourcing 
experiments.
The crowdsourcing taxonomy by Geiger et al. (2011) can be em-
ployed to classify existing crowdsourcing approaches into four different 
categories, based on: 1) who are the contributors, or rather which type 
of contributors are wanted for the application in question, and if they 
have to show their capacity for the given task first; 2) to which degree a 
user can access the contributions of other users; 3) how the contribu-
tions of different users are aggregated or selected; and 4) whether or 
under which circumstances contributions are remunerated. For cases 
where no remuneration is available, the authors list as potential mo-
tivational factors “passion, fun, community identification, or personal 
achievement”.
Another dimension is defined by the degree to which the partici-
pants are conscious as to whether they are contributing their efforts 
towards a particular goal. Most cases can be unequivocally assigned to 
one extreme or the other. Any paid crowdsourcing work is by definition 
explicit, unless the participants are paid for a different task than the 
one whose data is actually being crowdsourced. At the other extreme, 
analyzing log files to see how users interact with some software is a 
good example of implicit crowdsourcing (Wang et  al., 2019). In be-
tween we have situations with no explicit tasks and where users might 
or might not know that they are contributing data through their interac-
tions with software.
6 E.g.  the open PyBossa (https://pybossa.com/) or Amazon Mechanical Turk (https://www.
mturk.com/) for paid microservices.
115
Learning languages from parallel corpora: a blueprint for turning corpus examples...
7 The application
Figure 3: The PaCLE application showing five examples for a parallel corpus search in 
the English-Swedish part of OpenSubtitles. Matching parts are highlighted. The use of ad-
vanced regular expressions is supported.
The blueprint for the application that we describe in this work can 
be split into two phases: First, an offline phase, in which sentence pairs 
are extracted from parallel corpora, processed with (language-specific) 
116
Slovenščina 2.0, 2022 (2) | Articles
NLP techniques, assessed regarding their usefulness in language learn-
ing and, finally, added to a database. Second, an online phase, in which 
a web application interacts with two types of users, namely teachers 
and learners.7 The application allows users to perform searches in the 
corpus examples using metadata (e.g.  the source of the respective 
example) and derived measures (e.g. the estimated target proficiency 
levels) as filters. The retrieved sentence pairs can then be manually re-
viewed and turned into learning exercises. In Graën et al. (in press), we 
used an early prototype of the application in a language-learning class 
and analyzed the students’ use of the tool and other technologies. Fig-
ure 3 shows the user interface.8
One criterion for filtering out sentences in the offline phase is that 
they are not immediately comprehensible to the reader without the 
contexts in which they appear in the corpus. Pilán et al. (2017) provide 
an extensive overview of measures that can be employed for selecting 
corpus examples suitable for use in educational contexts. Some of the 
measures they list do not require sentences to be excluded a priori, but 
rather determine for which type and proficiency of learners they can be 
used (e.g. measures concerning grammatical or lexical complexity). In 
addition to monolingual criteria that are applied to one part of a parallel 
corpus,9 we define measures on sentence pairs that determine wheth-
er those pairs are added to the database and measures that are used 
in the online phase for making a selection that fits the requirements of 
a particular configuration (languages, search terms, learner proficiency 
level, exercise type, etc.).
A measure that can be used in both phases is the degree of 
equivalence between the two sentences in terms of syntactic struc-
tures and lexical items that are used as translations of each other. By 
7 We do not envisage providing two different applications or user modes for teachers and 
learners, as we conceive autonomous language learners as their own teachers and, beyond 
that, have no means to distinguish them technically.
8 We started developing the web application with desktop clients in mind. We discourage us-
ing the application on mobile phones as, from our perspective, the attention span on those 
devices is often lower, less information can be displayed (although today’s mobile phones 
typically have a high resolution), and user input is not as precise and fluent as with regular 
keyboards and pointing devices.
9 We do not distinguish between source and target languages at that stage. Later on, when 
selecting corpus examples in the online phase, we usually prefer the target language to be 
the one that is more comprehensible.
117
Learning languages from parallel corpora: a blueprint for turning corpus examples...
calculating structural equivalence in terms of the relative frequency 
that the structure in question is used in a parallel corpus in relation 
to the overall number of structures identified in both sentences, we 
obtain a ratio (values between 0 and 1) for which we define a thresh-
old for inclusion in the database. For lexical items, a similar formula is 
used. Higher values of both measures mean that we expect the sen-
tence pair in question to show more frequently used structural and 
lexical correspondence and, consequently, represent a more direct 
translation (as opposed to a freer one with less frequent correspond-
ences and, hence, lower values).
7.1 Corpora
While a variety of parallel corpora can be obtained easily, e.g. down-
loaded directly from the OPUS collection (Tiedemann, 2009, 2012), 
not all of them are equally suited for language learning purposes. For a 
corpus to fit the needs of learners, in the optimal case, it should com-
prise language material that a) is adequate for the proficiency level 
of said learners, b) comprises the material to be learned (lexical ele-
ments, grammatical constructions, and so on), c) be sufficiently large 
so that the application can choose from a large number of examples, 
and d) be of interest to the learner. The latter point is unequivocally 
learner-dependent, but we expect that there are domains that are gen-
erally better received than others (e.g. law texts vs. fiction).
One source of parallel texts that we found particularly useful for 
the purpose of language learning is the OpenSubtitles corpus (Lison 
and Tiedemann, 2016) which we used in Zanetti, Volodina, and Graën 
(2021), but also for the PaCLE application. It consists of translated 
subtitles for a large number of movies. Translations are contributed by 
users who can also review the work of other users. A large number of 
subtitles is available for most of the available 62 languages, but for 
some languages – such as Bengali, Georgian, or Tagalog – the coverage 
is quite low, and insufficient for our purposes.
Besides the large size and coverage of many language pairs with 
this corpus, subtitles have the advantage that “[they] cover various 
genres and time periods and combine features from spoken language 
118
Slovenščina 2.0, 2022 (2) | Articles
corpora and narrative texts including many dialogs, idiomatic expres-
sions, dialectal expressions and slang” (Tiedemann, 2012).
Similar to OpenSubtitles, we find a richer vocabulary and less for-
mal language in corpora of transcribed speech, such as the parliamen-
tary proceedings of the European Union (Koehn 2005), the Canadian 
Hansards (described in Gale and Church, 1991, 1993) or the TED Talks 
corpus (Reimers and Gurevych, 2020).
Corpora compiled from legislative texts, patents, technical manu-
als, medication leaflets, and other more restricted text types might be 
helpful for particular learning tasks and more advanced learners, but 
they are hardly suited for most learners with lower proficiency levels. 
We can also expect to find considerably fewer appearances of offensive 
language, often abbreviated as PARSNIP, than in monolingual corpora 
(Dekker et al., 2019) for the same reason.
7.2 Data preparation
Modern NLP applications use language models that can perform sev-
eral annotation tasks simultaneously. Performance measures show 
that those joint models outperform traditional pipeline approaches 
(Qi et al., 2020). The standard tasks for such models to perform are 
tokenization, lemmatization, part-of-speech tagging, and syntactic de-
pendency parsing. Other tasks include morphological analysis, named-
entity recognition, and word-sense disambiguation, all of which provide 
valuable information for the creation of language learning exercises.
Some corpora are provided pre-aligned (typically on the sentence 
level), but there are corpora indicating alignment only on a higher lev-
el, such as documents or chapters. In such cases we need to perform 
document alignment first, followed by sentence alignment to obtain 
parallel sentences. The correspondence of documents is to a large ex-
tent corpus-specific, and thus no out-of-the-box solutions can be em-
ployed (Graën, 2018, Section 4.1). In the case of multiparallel corpora, 
we might want to apply approaches that produce consistent multilin-
gual alignments (Graën, 2018, Section 4.3).
We also need the retrieved and annotated sentence pairs to 
be word-aligned. By combining the results of different aligners and 
119
Learning languages from parallel corpora: a blueprint for turning corpus examples...
different types of aligners (probabilistic measures vs.  word embed-
dings), we obtain the most reliable alignment links. We then group the 
correspondence links between single tokens using syntactic relations 
as described in Zanetti et  al. (2021). After this, function words such 
as prepositions or particles that often have no correspondence in an-
other language are part of larger units for which we can assert corre-
spondence with higher precision. The groups we build with the help of 
dependency and alignment relations often correspond to phrases, but 
this is not necessarily always the case.
Alignment probabilities calculated on the whole corpus or ob-
tained from another source help us to identify idiomaticity (Schneider 
and Graën, 2018). In support verb constructions, for example, the cor-
respondence of the aligned nouns, which are frequently direct objects 
of the verb in question, is a very strong one; that is, we expect it to 
be the prototypical translation equivalent, while the correspondence 
of the governing verbs is often an infrequent one (but it can also be 
the case that the same support verb is used). The English support verb 
Figure 4: Sentence pair in German and English with different syntactic structures, which is 
highlighted by the heavily crossing alignment links. Here, language-dependent label sets 
have been used instead of Universal Dependencies.
120
Slovenščina 2.0, 2022 (2) | Articles
construction “(to) take a walk”, for example, and the Spanish one “dar 
un paseo” (“give a walk”) are common translations of each other. The 
nouns “walk” and “paseo” also show a high alignment probability in 
any parallel English-Spanish corpus. However, “take” is only a good 
translation of “dar” as part of a limited number of other expressions 
other than “(to) take a walk” / “dar un paseo” (e.g. ”take a step” and 
“dar un paso”).
7.3 Example selection
For the selection of adequate sentence pairs, we envisage using clas-
sifiers like the ones described in Pilán et al. (2017), Pilán (2018) and 
Tack (2021) for the individual sentences. In addition to the estimat-
ed proficiency levels, we will compare the aligned groups of tokens. 
Noun phrases that translate to noun phrases are arguably less chal-
lenging than completely diverging structures. By aggregating syntac-
tic structures and calculating conditional probabilities from the ob-
served frequencies in a large parallel corpus, we can say how likely it 
is for a particular syntactic structure in one language to be translated 
to another structure in the other language. The main idea here is that 
structural correspondences with higher probabilities will be more ad-
vantageous for language learning. Nonetheless, non-standard or less 
frequent correspondences will certainly be of interest for more ad-
vanced learners (Figure 4 shows an example).
7.4 Exercise generation
The combination of two sentences including word alignment paves the 
way for a whole new range of exercise types. At the same time, we can 
use the information of word and phrase correspondence to improve 
common monolingual exercises. For cloze tests, for instance, we can 
use the translation of the sentence in question to identify distractors 
that are unlikely to accidentally fit in the gap.
Contrastive exercises look for similarities and differences between 
the source and target language, and thus foster metalinguistic aware-
ness. Properties that could be the focus of such exercises are mor-
phological features (e.g. grammatical genders), the order of syntactic 
121
Learning languages from parallel corpora: a blueprint for turning corpus examples...
elements (e.g. the position of modifying adjectives relative to their gov-
ernor), or the use of discourse markers.
In the parallel reordering exercise presented in Zanetti et al. (2021) 
and in the gap-filling exercise with parallel clues presented in Alfter 
and Graën (2019), the source language serves as an anchor for the 
learner. Truly multilingual exercises are those where there is no distinc-
tion between source and target languages. One example is a gap-filling 
or cloze exercise in the style of bundled gaps (Wojatzki et al., 2016) but 
with word pairs (or triples, …) in two (or three, …) different languages. A 
potential way to find good distractors is to generate different inflections 
of the original words that have been replaced by the gaps. Alternatively, 
homographs or false friends can be used with non-parallel sentences 
to focus on differences and similarities.
7.5 Crowdsourcing aspects
The way the application is intended to be used is threefold. First, we 
envisage an autonomous learner – i.e. a more advanced learner with 
a good command of technology – to use the application for looking up 
words, expressions, or grammatical constructions in context togeth-
er with their translations. In this scenario, we use the annotation and 
alignment layers obtained during corpus preparation to let the user 
interactively explore the examples that they found. Learners can add 
particular examples to (named) collections, mark their favorites and 
report entire sentence pairs, individual annotations, or alignment links 
that they consider false or dubious.
In the second scenario, teachers look up examples relevant to their 
respective topics, with respect to both content and language. They 
group examples in collections from which they can feed the in-class 
exercises that they prepare. Sharing those collections between teach-
ers and collaborating on the creation of language learning material is 
facilitated by the application (e.g. by just copying an URL and sending it 
to other teachers or students).
The third scenario goes one step further. Here, teachers use collec-
tions of corpus examples to generate exercises. Generated exercises 
can be reviewed and discarded as needed, but the parallelism in the 
122
Slovenščina 2.0, 2022 (2) | Articles
exercise types should generally result in higher precision, so good ac-
curacy can be expected. Teachers then share those exercises with their 
students who, in turn, can also provide feedback in terms of reporting 
any errors or discrepancies in the example items.
In all scenarios, users should be able to fix errors for themselves, 
such as by correcting spelling mistakes in the original corpus material, 
or propose changes that can be reviewed by other users. The simplest 
solution that does not require a dedicated user or group to review all 
proposals is to explicitly ask other users and let them up- or downvote 
the (proposed) changes. In cases with a clear tendency of mostly up-
votes, the solution would be automatically accepted and replace the 
original example. The current prototype allows users to edit the actual 
examples, accept or reject them, and put them on a list of favorites, 
which is meant to keep those examples that learners consider valuable 
to them.
The type of crowdsourcing envisaged for the different scenarios is 
both explicit and implicit. Explicit crowdsourcing involves error correc-
tion and the categorization of annotations as dubious. When users are 
explicitly asked by the application for their opinions on changes pro-
posed by other users, they are also explicitly contributing their knowl-
edge. The collaborative elaboration of language learning material falls 
in the category of crowd annotation.
When users mark their favorite examples or remove elements 
from their collections, they contribute in an implicit way. We can 
only guess why examples have been removed; it might be due to er-
rors in the examples themselves, their annotation, because they are 
not comprehensible for the individual learner, or they simply do not 
match the topic in question. In cases of doubt, we can always turn 
those choices into explicit questions with which we ask other users 
for clarification.
It is important to note that all crowdsourcing tasks are designed 
to stem from intrinsic motivation. The added value of using the ap-
plication for self-learning – which is the corpus search function or the 
assistance provided with the creation of learning exercises – needs 
to convince learners and teachers to voluntarily contribute to the 
project.
123
Learning languages from parallel corpora: a blueprint for turning corpus examples...
8 Conclusions
We have discussed a blueprint for an application that generates lan-
guage learning exercises from parallel corpora. To this end, we have 
outlined the required methods and techniques, and described how it is 
envisaged they will work together in the final application.
Moreover, we have argued how the ensemble of annotation and 
alignment of parallel corpora can be employed to reduce the uncertain-
ty about potential errors in automatically generated exercises. What is 
more, the use of parallel material paves the way for a multitude of novel 
exercise types that encourage learners to contrast target and source 
languages, and thus strengthen their metalinguistic capabilities.
In short, with the help of implicit and explicit crowdsourcing, we 
expect language learning material to gradually improve over time.
Acknowledgments
This research is partly supported by the Swiss National Science Foundation 
under grant P2ZHP1 184212 through the project “From parallel corpora to 
multilingual exercises: Making use of large text collections and crowdsourcing 
techniques for innovative autonomous language learning applications”, con-
ducted at Pompeu Fabra University in Barcelona (with Grael, Grup de Recerca 
en Aprenentatge i Ensenyament de Llengües, and at the University of Gothen-
burg (with Språkbanken Text).
References
Alfter, D., & Graën, J. (2019). Interconnecting Lexical Resources and Word 
Alignment: How Do Learners Get on with Particle Verbs? In Proceedings of 
the 22nd Nordic Conference of Computational Linguistics (NODALIDA) (pp. 
321–26). Turku, Finland: Linköping University Electronic Press. Retrieved 
from https://www.aclweb.org/anthology/W19-6135
Barrón-Cedeno, A., España Bonet, C., Boldoba Trapote, J., & Márquez Villodre, 
L. (2015). A Factory of Comparable Corpora from Wikipedia. In Proceed-
ings of the Eighth Workshop on Building and Using Comparable Corpora 
(pp. 3–13). Association for Computational Linguistics.
Bluemel, B. (2014). Learning in Parallel: Using Parallel Corpora to Enhance 
Written Language Acquisition at the Beginning Level. Dimension, 31, 48.
124
Slovenščina 2.0, 2022 (2) | Articles
Boulton, A., & Cobb, T. (2017). Corpus Use in Language Learning: A Meta-Anal-
ysis. Language Learning, 67(2), 348–393.
Braune, F., & Fraser, A. (2010). Improved Unsupervised Sentence Alignment 
for Symmetrical and Asymmetrical Parallel Corpora. In Proceedings of the 
23rd International Conference on Computational Linguistics (COLING): 
Posters (pp. 81–89). Association for Computational Linguistics (ACL).
Cambridge University Press. 2015. English Vocabulary Profile. Retrieved from 
https://www.englishprofile.org/wordlists
Clematide, S., Graën, J., & Volk, M. (2016). Multilingwis – a Multilingual Search 
Tool for Multi-Word Units in Multiparallel Corpora. In G. Corpas Pas-
tor (Ed.), Computerised and Corpus-Based Approaches to Phraseology: 
Monolingual and Multilingual Perspectives – Fraseologia Computacional 
y Basada En Corpus: Perspectivas Monolingües y Multilingües (pp. 447–
455). Geneva: Tradulex. doi: 10.5167/uzh-120153
Cobb, T., & Boulton, A. (2015). Classroom Applications of Corpus Analysis. In 
D. Biber & R. Reppen (Eds.), The Cambridge Handbook of English Corpus 
Linguistics (pp. 478–497). Cambridge University Press. doi: 10.1017/
CBO9781139764377.027
Council of Europe. (2001). Common European Framework of Reference for 
Languages: Learning, Teaching, Assessment. Press Syndicate of the Uni-
versity of Cambridge.
Dekker, P., Zingano Kuhn, T., Šandrih, B., Zviel-Girshin, R., Arhar Holdt, Š., & 
Schoonheim, T. (2019). Corpus Filtering via Crowdsourcing for Develop-
ing a Learner’s Dictionary. In I. Kosem & S. Krek (Eds.), Proceedings of the 
eLexicography in the 21st Century (eLex 2019): Smart Lexicography, 1–3 
October 2019, Sintra, Portugal (pp. 84–85). Brno: Lexical Computing CZ, 
s.r.o.
Dou, Z.-Y., & Neubig, G. (2021). Word Alignment by Fine-Tuning Embeddings 
on Parallel Corpora. In Conference of the European Chapter of the Asso-
ciation for Computational Linguistics (EACL), 19–23 April 2021.
Dürlich, L., & François, T. (2018). EFLLex: A Graded Lexical Resource for Learn-
ers of English as a Foreign Language. In N. Calzolari et al. (Eds.), Proceed-
ings of the 11th International Conference on Language Resources and 
Evaluation, 7–12 May 2018, Miyazaki, Japan. European Language Re-
sources Association (ELRA).
Eisele, A., & Chen, Y. (2010). MultiUN: A Multilingual Corpus from United Nation 
Documents. In N. Calzolari et al. (Eds.), Proceedings of the 7th Internation-
al Conference on Language Resources and Evaluation (LREC), 17–23 May 
125
Learning languages from parallel corpora: a blueprint for turning corpus examples...
2010, Valletta, Malta (pp. 2868–2872). European Language Resources As-
sociation (ELRA). Retrieved from https://aclanthology.org/volumes/L10-1/ 
Fouz-González, J. (2015). Trends and Directions in Computer-Assisted Pro-
nunciation Training. Investigating English Pronunciation, 314–342.
François, T., Fairon, C., & Watrin, P. (2016). CEFRLex: A Graded Lexical Re-
source for French Foreign Learners. Retrieved from http://cental.uclou-
vain.be/cefrlex/
François, T., Gala, N., Watrin, P., & Fairon, C. (2014). FLELex: A Graded Lexical 
Resource for French Foreign Learners. In N. Calzolari et  al. (Eds.), Pro-
ceedings of the 9th International Conference on Language Resources and 
Evaluation (LREC), 26–31 May, Reykjavik, Iceland (pp. 3766–3773). Eu-
ropean Language Resources Association (ELRA). Retrieved from https://
aclanthology.org/L14-1
François, T., Volodina, E., Pilán, I., & Tack, A. (2016). SVALex: A CEFR-Graded 
Lexical Resource for Swedish Foreign and Second Language Learners. In N. 
Calzolari et al. (Eds.), Proceedings of the 10th International Conference on 
Language Resources and Evaluation (LREC), May 2016, Portorož, Slovenia 
(pp. 213–219). Retrieved from https://aclanthology.org/L16-1032.pdf 
Gaillat, T., Simpkin, A., Ballier, N., Stearns, B., Sousa, A., Bouyé, M., & Zarrouk, 
M. (2022). Predicting CEFR Levels in Learners of English: The Use of Mi-
crosystem Criterial Features in a Machine Learning Approach. ReCALL, 
34(2), 130–146.
Gale, W. A., & Church, K. W. (1991). A Program for Aligning Sentences in 
Bilingual Corpora. In D. E. Appelt et  al. (Eds.), Proceedings of the 29th 
Annual Meeting of the Association for Computational Linguistics (ACL), 
18–21 June 1991, Berkeley, California, USA (pp. 177–184). Strouds-
burg, PA, USA. Association for Computational Linguistics (ACL). doi: 
10.3115/981344.981367
Gale, W. A., & Church, K. W. (1993). A Program for Aligning Sentences in Bilin-
gual Corpora. Computational Linguistics, 19(1), 75–102.
Geiger, D., Seedorf, S., Schulze, T., Nickerson, R. C., & Schader, M. (2011). 
Managing the Crowd: Towards a Taxonomy of Crowdsourcing Processes. 
In AMCIS 2011 Proceedings - All Submissions: Virtual Communities and 
Collaborations (p. 430).
Graën, J. (2018). Exploiting Alignment in Multiparallel Corpora for Applications 
in Linguistics and Language Learning. PhD thesis. University of Zurich.
Graën, J., Alfter, D., & Schneider, G. (2020). Using Multilingual Resources to 
Evaluate CEFRLex for Learner Applications. In Proceedings of the 12th 
126
Slovenščina 2.0, 2022 (2) | Articles
Language Resources and Evaluation Conference (LREC), 2020, Marseille, 
France (pp. 346–355). Marseille, France: European Language Resourc-
es Association (ELRA). Retrieved from https://www.aclweb.org/anthol-
ogy/2020.lrec-1.43
Graën, J., Bach, C., & Cassany, D. (in press). Using a Bilingual Concordancer to 
Promote Metalinguistic Reflection in the Learning of an Additional Lan-
guage: The Case of B1 Learners of Catalan. In n/a. Peter Lang.
Graën, J., Batinic, D., & Volk, M. (2014). Cleaning the Europarl Corpus for Lin-
guistic Applications. In J. Ruppenhofer & G. Faaß (Eds.), Proceedings of 
the 12th edition of the Conference on Natural Language Processing (KON-
VENS) (Vol 1, pp. 222–227). Stiftung Universität Hildesheim. GSCL, ÖGAI, 
DGfS, Clarin-D, University of Hildesheim. doi: 10.5167/uzh-99005
Graën, J., Kew, T., Shaitarova, A., & Volk, M. (2019). Modelling Large Parallel 
Corpora: The Zurich Parallel Corpus Collection. In P. Bański et al. (Eds.), 
Challenges in the Management of Large Corpora (CMLC). Leibniz-Institut 
für Deutsche Sprache. doi: 10.14618/ids-pub-9020
Graën, J., Sandoz, D., & Volk, M. (2017). Multilingwis. Explore Your Parallel 
Corpus. In J. Tiedemann & N. Tahmasebi (Eds.), Proceedings of the 21st 
Nordic Conference of Computational Linguistics (NODALIDA), May 2017, 
Gothenburg, Sweden (pp. 247–250). Association for Computational Lin-
guistics (ACL). doi: 10.5167/uzh-137129
Graën, J., & Schneider, G. (2020). Exploiting Multiparallel Corpora as a Meas-
ure for Semantic Relatedness to Support Language Learners. In D. Levey 
(Ed.), Strategies and Analyses of Language and Communication in Mul-
tilingual and International Contexts (pp. 153–167). Cambridge Scholars 
Publishing.
Heift, T., & Vyatkina, N. (2017). Technologies for Teaching and Learning L2 
Grammar. The Handbook of Technology and Second Language Teaching 
and Learning, 26–44.
Jalili Sabet, M., Dufter, P., Yvon, F., & Schütze, H. (2020). SimAlign: High Qual-
ity Word Alignments Without Parallel Training Data Using Static and 
Contextualized Embeddings. In B. Webber, T. Cohn, Y. He & Y. Liu (Eds.), 
Proceedings of the 2020 Conference on Empirical Methods in Natural Lan-
guage Processing: Findings, November 2020, online (pp. 1627–1643). 
Association for Computational Linguistics (ACL). Retrieved from https://
www.aclweb.org/anthology/2020.findings-emnlp.147
Jiang, C., Maddela, M., Lan, W., Zhong, Y., & Xu, W. (2020). Neural CRF Model 
for Sentence Alignment in Text Simplification. In D. Jurafsky, J. Chai, N. 
127
Learning languages from parallel corpora: a blueprint for turning corpus examples...
Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the 
Association for Computational Linguistics, July 2020, online (pp. 7943–
7960). Association for Computational Linguistics (ACL). doi: 10.18653/
v1/2020.acl-main.709
Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Transla-
tion. In Machine Translation Summit, 5, 79–86. Asia-Pacific Association 
for Machine Translation.
Laufer, B., & Ravenhorst-Kalovski, G. C. (2010). Lexical Threshold Revis-
ited: Lexical Text Coverage, Learners’ Vocabulary Size and Reading 
Comprehension.
Lawson, A. (2001). Collecting, Aligning and Analysing Parallel Corpora. Small 
Corpus Studies and ELT: Theory and Practice. Amsterdam, John Benja-
mins, 279–309.
Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large Par-
allel Corpora from Movie and TV Subtitles. In N. Calzolari et  al. (Eds.), 
Proceedings of the 10th International Conference on Language Resources 
and Evaluation (LREC), May 2016, Portorož, Slovenia. European Language 
Resources Association (ELRA). Retrieved from https://aclanthology.org/
L16-1147/ 
Lu, X. (2018). Natural Language Processing and Intelligent Computer-
Assisted Language Learning (ICALL). In The TESOL Encyclopedia of 
English Language Teaching (pp. 1–6). John Wiley & Sons, Ltd. doi: 
10.1002/9781118784235.eelt0422
McEnery, T., & Xiao, Z. (2007). Parallel and Comparable Corpora: The State of 
Play. Corpus-Based Perspectives in Linguistics 6.
Montero Perez, M., Paulussen, H., Macken, L., & Desmet, P. (2014). From In-
put to Output: The Potential of Parallel Corpora for CALL. Language Re-
sources and Evaluation, 48(1), 165–189.
Mousavian Rad, S. E., Roohani, A., & Mirzaei, A. (2022). Developing and Vali-
dating Precursors of Students’ Boredom in EFL Classes: An Exploratory 
Sequential Mixed-Methods Study. Journal of Multilingual and Multicultur-
al Development, 1–18. doi: 10.1080/01434632.2022.2082448 
Nakata, T., & Webb, S. (2016). Vocabulary Learning Exercises: Evaluating a 
Selection of Exercises Commonly Featured in Language Learning Materi-
als. In SLA Research and Materials Development for Language Learning, 
139–154. Routledge.
Nation, I. S. P., & Webb, S. 2011. Researching and Analyzing Vocabulary. Hein-
le, Cengage Learning Boston, MA.
128
Slovenščina 2.0, 2022 (2) | Articles
Otero, P. G., & González López, I. (2010). Wikipedia as Multilingual Source of 
Comparable Corpora. In Proceedings of the 3rd Workshop on Building and 
Using Comparable Corpora, LREC (pp. 21–25). Citeseer.
Pearson. (2017). GSE Teacher Toolkit. Retrieved from https://www.english.
com/gse/teacher-toolkit/user/lo
Pilán, I. (2018). Automatic Proficiency Level Prediction for Intelligent Comput-
er-Assisted Language Learning. PhD thesis. University of Gothenburg.
Pilán, I., Volodina, E., & Borin, L. (2017). Candidate Sentence Selection for 
Language Learning Exercises: From a Comprehensive Framework to an 
Empirical Evaluation. Revue Traitement Automatique Des Langues. Spe-
cial Issue on NLP for Learning and Teaching. 57(3), 67–91.
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Py-
thon Natural Language Processing Toolkit for Many Human Languages. 
In D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of 
the 58th Annual Meeting of the Association for Computational Linguis-
tics: System Demonstrations, July 2020, online (pp. 101–108). Asso-
ciation for Computational Linguistics (ACL). doi: 10.18653/v1/2020.
acl-demos.14
Rafalovitch, A., & Dale, R. (2009). United Nations General Assembly Reso-
lutions: A Six-Language Parallel Corpus. In Proceedings of the Machine 
Translation Summit, 12, 292–299.
Reimers, N., & Gurevych, I. (2020). Making Monolingual Sentence Embed-
dings Multilingual Using Knowledge Distillation. In Q. Liu & D. Schlangen 
(Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natu-
ral Language Processing (EMNLP) (pp. 4512–4525). Association for Com-
putational Linguistics (ACL). doi: 10.18653/v1/2020.emnlp-main.365
Ribeiro, M. S. (2018). Parallel Audiobook Corpus (version 1.0), University of 
Edinburgh. School of Informatics. doi: 10.7488/ds/2468
Scherrer, Y., Nerima, L., Russo, L., Ivanova, M., & Wehrli, E. (2014). Swis-
sAdmin: A Multilingual Tagged Parallel Corpus of Press Releases. In N. 
Calzolari et al. (Eds.), Proceedings of the 9th International Conference on 
Language Resources and Evaluation (LREC), 26–31 May, Reykjavik, Ice-
land. European Language Resources Association (ELRA). Retrieved from 
https://aclanthology.org/L14-1
Schneider, G., & Graën, J. (2018). NLP Corpus Observatory – Looking for Con-
stellations in Parallel Corpora to Improve Learners’ Collocational Skills. 
In I. Pilán, E. Volodina, D. Alfter & L. Borin (Eds.), Proceedings of the 7th 
Workshop on NLP for Computer Assisted Language Learning at SLTC 2018 
129
Learning languages from parallel corpora: a blueprint for turning corpus examples...
(NLP4CALL), November 2018, Stockholm, Sweden (pp. 69–78). LiU Elec-
tronic Press. doi: 10.5167/uzh-157985
Schwab, S., & Goldman, J.-P. (2018). MIAPARLE: Online Training for Discrimi-
nation and Production of Stress Contrasts. In K. Klessa et al. (Eds.), Proc. 
9th Int. Conf. Speech Prosody, 13–16 June 2018, Poznań, Poland (pp. 
572–576). doi: 10.21437/SpeechProsody.2018-116 
Sennrich, R., & Volk, M. (2010). MT-Based Sentence Alignment for OCR-Gen-
erated Parallel Texts. In Proceedings of the 9th Conference of the Associa-
tion for Machine Translation in the Americas (AMTA), 31 October – 5 No-
vember 2010, Denver, Colorado, USA. Association for Machine Translation 
in the Americas (AMTA). Retrieved from https://aclanthology.org/2010.
amta-papers.14.pdf 
Steingrı́msson, S., Loftsson, H., & Way, A. (2021). CombAlign: A Tool for Ob-
taining High-Quality Word Alignments. In Proceedings of the 23rd Nordic 
Conference on Computational Linguistics (NoDaLiDa), 31 May – 2 June 
2021, Reykjavik, Iceland, Sweden, online (pp. 64–73). Linköping Uni-
versity Electronic Press, Sweden. Retrieved from https://aclanthology.
org/2021.nodalida-main.7
Tack, A. (2021). Mark My Words! On the Automated Prediction of Lexical Dif-
ficulty for Foreign Language Readers. PhD thesis.
Thompson, B., & Koehn, P. (2019). Vecalign: Improved Sentence Alignment 
in Linear Time and Space. In Proceedings of the 2019 Conference on Em-
pirical Methods in Natural Language Processing and the 9th International 
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), No-
vember 2019, Hong Kong, China (pp. 1342–1348). Association for Com-
putational Linguistics (ACL). Retrieved from https://aclanthology.org/
D19-3.pdf 
Tiedemann, J. (2009). News from OPUS – a Collection of Multilingual Parallel 
Corpora with Tools and Interfaces. In Proceedings of Recent Advances in 
Natural Language Processing (RANLP), 5, 237–248.
Tiedemann, J. (2011). Synthesis Lectures on Human Language Technologies 2. 
Morgan & Claypool. doi: 10.2200/S00367ED1V01Y201106HLT014
Tiedemann, J. (2012). Parallel Data, Tools and Interfaces in OPUS. In N. Cal-
zolari et al. (Eds.), Proceedings of the 8th International Conference on Lan-
guage Resources and Evaluation (LREC), May 2012, Istanbul, Turkey (pp. 
2215–2218). European Language Resources Association (ELRA). Re-
trieved from http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_
Paper.pdf 
130
Slovenščina 2.0, 2022 (2) | Articles
Vanallemeersch, T. (2010). Belgisch Staatsblad Corpus: Retrieving French-
Dutch Sentences from Official Documents. In N. Calzolari et  al. (Eds.), 
Proceedings of the 7th International Conference on Language Resources 
and Evaluation (LREC), May 2010, Valletta, Malta (pp. 3413–3416). Eu-
ropean Language Resources Association (ELRA). Retrieved from http://
www.lrec-conf.org/proceedings/lrec2010/pdf/758_Paper.pdf 
Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V., & Nagy, V. (2005). Paral-
lel Corpora for Medium Density Languages. In G. Angelova, K. Bontcheva, 
R. Mitkov, N. Nicolov, N. Nikolov (Eds.), Proceedings of Recent Advances in 
Natural Language Processing (RANLP), 21–23 September 2005, Borovets, 
Bulgaria (pp. 590–596). Retrieved from http://lml.bas.bg/ranlp2005/ 
Volk, M., Amrhein, C., Aepli, N., Müller, M., & Ströbel, P. (2016). Building a Par-
allel Corpus on the World’s Oldest Banking Magazine. In KONVENS. s.n. 
doi: 10.5167/uzh-125746.
Volk, M., Bubenhofer, N., Althaus, A., Bangerter, M., Furrer, L., & Ruef, B. 
(2010). Challenges in Building a Multilingual Alpine Heritage Corpus. In 
N. Calzolari et al. (Eds.), Proceedings of the 7th International Conference 
on Language Resources and Evaluation (LREC), 17–23 May 2010, Vallet-
ta, Malta. European Language Resources Association (ELRA). Retrieved 
from http://www.lrec-conf.org/proceedings/lrec2010/pdf/110_Paper.pdf 
Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., & Blum, M. (2008). Re-
captcha: Human-Based Character Recognition via Web Security Meas-
ures. Science, 321(5895), 1465–68.
Wang, C., Daneva, M., Van Sinderen, M., & Liang, P. (2019). A Systematic Map-
ping Study on Crowdsourced Requirements Engineering Using User Feed-
back. Journal of Software: Evolution and Process, 31(10), e2199.
Wilson, E. (1997). The Automatic Generation of CALL Exercises from General 
Corpora. In A. Wichmann, S. Fligelstone, T. McEnery & G. Knowles (Eds.), 
Teaching and Language Corpora (Applied linguistics and language study) 
(pp. 116–30).
Wojatzki, M., Melamud, O., & Zesch, T. (2016). Bundled Gap Filling: A New 
Paradigm for Unambiguous Cloze Exercises. In Proceedings of the 11th 
Workshop on Innovative Use of NLP for Building Educational Applications, 
June 2016, San Diego, CA (pp. 172–81). Association for Computational 
Linguistics (ACL). doi: 10.18653/v1/W16-0519
Zanetti, A., Volodina, E., & Graën, J. (2021). Automatic Generation of Exercises 
for Second Language Learning from Parallel Corpus Data. International 
Journal of TESOL Studies, 3(2), 55–71.
131
Learning languages from parallel corpora: a blueprint for turning corpus examples...
Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (2016). The United Na-
tions Parallel Corpus V1.0. In N. Calzolari et al. (Eds.), Proceedings of the 
10th International Conference on Language Resources and Evaluation 
(LREC), May 2016, Portorož, Slovenia. European Language Resources As-
sociation (ELRA). Retrieved from https://aclanthology.org/L16-1561.pdf 
Učenje jezikov iz vzporednih korpusov: zasnova za spreminjanje 
korpusnih primerov v vaje za učenje jezikov
Članek opisuje arhitekturo aplikacije, ki iz vzporednih korpusov generira vaje 
za učenje jezika. Poravnava besed in vzporedne strukture omogočajo samo-
dejno ocenjevanje stavčnih parov v izvornem in ciljnem jeziku, medtem ko 
uporabniki aplikacije s svojimi interakcijami nenehno izboljšujejo kakovost po-
datkovne zbirke in tako množičijo vzporedno jezikovno učno gradivo. S pomo-
čjo triangulacije se lahko njihovo ocenjevanje prenese tudi na druge jezikovne 
pare, če kot vir uporabimo več vzporednih korpusov.
Da bi lahko takšna aplikacija delovala, je treba nasloviti več izzivov. V na-
daljevanju bomo obravnavali tri. Prvič, v zadnjem desetletju se je nekaj pozor-
nosti posvetilo vprašanju, kako v korpusih prepoznati ustrezno učno gradivo. 
Podrobno bomo opisali, kako na to vpliva struktura vzporednih korpusov. Dru-
gič, katere vrste vaj je mogoče samodejno ustvariti iz vzporednih korpusov, 
tako da spodbujajo učenje in ohranjajo motivacijo učencev. In tretjič, kakšne 
so možnosti vključevanja uporabnikov, tj. učiteljev in učencev, kot množice, ki 
bi pomagala izboljšati gradivo.
Aplikacijo, ki jo opisujemo v članku, smo delno implementirali in preizkusi-
li v različnih eksperimentalnih okoljih. Več funkcij, ki bodo vključene v končno 
programsko opremo, smo razvili in ovrednotili ločeno. Za implementacijo vseh 
delov, ki so podrobno opisani v tem dokumentu, pa je potrebno še veliko dela 
in razpoložljivost dejanskih učiteljev in učencev za namene preskušanja. Da bi 
lahko potrdili želene pozitivne učinke prispevkov uporabnikov, bo treba konč-
ne aplikacije uporabljati dalj časa, kar predstavlja še dodaten izziv.
Ključne besede: ICALL, vaje za učenje jezikov, vzporedni korpusi, učenje na 
podlagi podatkov, množičenje