62
Data preparation in crowdsourcing 
for pedagogical purposes: the case 
of the CrowLL game
Tanara ZINGANO KUHN 
Research Centre for General and Applied Linguistics, University of Coimbra 
Špela ARHAR HOLDT
Faculty of Arts, University of Ljubljana; Faculty of Computer and Information 
Science, University of Ljubljana
Iztok KOSEM
Faculty of Arts, University of Ljubljana; Jožef Stefan Institute
Carole TIBERIUS
Dutch Language Institute
Kristina KOPPEL
Institute of the Estonian Language
Rina ZVIEL-GIRSHIN 
Ruppin Academic Center
One way to stimulate the use of corpora in language education is by making 
pedagogically appropriate corpora, labeled with different types of problems 
(sensitive content, offensive language, structural problems). However, manu-
ally labeling corpora is extremely time-consuming and a better approach 
Zingano Kuhn. T., Arhar Holdt, Š., Kosem, I., Tiberius, C., Koppel, K., Zviel-Girshin, R.: 
Data preparation in crowdsourcing for pedagogical purposes: the case of the 
CrowLL game. Slovenščina 2.0, 10(2): 62–100. 
1.01 Izvirni znanstveni članek / Original Scientific Article
DOI: https://doi.org/10.4312/slo2.0.2022.2.62-100
https://creativecommons.org/licenses/by-sa/4.0/
63
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
should be found. We thus propose a combination of two approaches to the 
creation of problem-labeled pedagogical corpora of Dutch, Estonian, Slovene 
and Brazilian Portuguese: the use of games with a purpose and of crowd-
sourcing for the task. We conducted initial experiments to establish the suit-
ability of the crowdsourcing task, and used the lessons learned to design the 
Crowdsourcing for Language Learning (CrowLL) game in which players identify 
problematic sentences, classify them, and indicate problematic excerpts. The 
focus of this paper is on data preparation, given the crucial role that such a 
stage plays in any crowdsourcing project dealing with the creation of language 
learning resources. We present the methodology for data preparation, offering 
a detailed presentation of source corpora selection, pedagogically oriented 
GDEX configurations, and the creation of lemma lists, with a special focus on 
common and language-dependent decisions. Finally, we offer a discussion of 
the challenges that emerged and the solutions that have been implemented 
so far. 
Keywords: crowdsourcing, game with a purpose, example sentences, peda-
gogical corpus
1 Introduction
Evidence of authentic language use is fundamental for language learn-
ing. One way to access this evidence is through the use of examples 
from corpora, i.e., large collections of texts produced in natural con-
texts, saved in electronic form. However, these corpora may include 
sensitive content or offensive language, in addition to exhibiting struc-
tural problems. While such use is unquestionably authentic, some 
teachers or material developers might consider it to be inappropriate 
for their needs, thus finding it necessary to manually filter the corpus 
before applying authentic examples to pedagogical contexts, which is 
a laborious task. 
To facilitate and stimulate the use of corpora in education we pro-
pose creating problem-labeled pedagogical corpora. This way, the pro-
cess of example selection could be significantly streamlined. At the 
same time, instead of deleting potentially problematic content from 
the corpus we will label it, thus leaving the choice of the use of certain 
64
Slovenščina 2.0, 2022 (2) | Articles
examples dependent on the needs and contexts of use of teachers and 
didactic material developers. The types of problems to be labeled are: 
vulgar, offensive, sensitive content, grammar/spelling problems, in-
comprehensible/lack of context. 
Creating such corpora is challenging due to at least three reasons. 
Firstly, the process of labeling sentences in corpora is extremely time-
consuming, if done manually. Secondly, automatic labeling can also 
be demanding given the polysemic nature of words. Thirdly, sensitivity 
and offensiveness are rather subjective concepts. Our proposal is thus 
to use the help from the crowd to achieve this task. For that, we are 
currently developing CrowLL – Crowdsourcing for Language Learning,1 
a multi-mode, multi-language (Dutch, Estonian, Slovene, and Portu-
guese) digital game. In this game, the players will be offered two exam-
ples (automatically extracted from existing corpora) and prompted to 
choose one (or both, or even none) that they consider to be appropriate 
for language teaching purposes. They will be asked to categorize the 
problem(s) of the example that has not been chosen and point out the 
constituent parts of the sentence that they consider to be problematic. 
With the output obtained from the players, we will compile problem-la-
beled pedagogical corpora for the languages mentioned above. These 
corpora can be used for the development of auxiliary language learn-
ing resources, such as Sketch Engine for Language Learning − SKELL 
(Baisa and Suchomel, 2014),2 dictionaries and teaching materials; and, 
within Natural Language Processing, for the creation of datasets aimed 
at training machine learning algorithms for the compilation of larger 
pedagogical corpora.
Data preparation plays a crucial role in any crowdsourcing project 
that deals with the creation of language learning resources. Indeed, 
the quality and structure of the input data, together with the type of 
1 The research group carrying out the Crowdsourcing Corpus Filtering for Pedagogical Purpos-
es project, within which the Crowdsourcing for Language Learning (CrowLL) game is being 
developed, originated under the umbrella of the European Network for Combining Language 
Learning with Crowdsourcing Techniques (enetCollect) COST Action (CA 16105). It is currently 
composed of seven members from six countries (Brazil, Estonia, Israel, Netherlands, Slovenia, 
and Portugal) and encompasses four languages (Dutch, Estonian, Slovene, and Portuguese). 
See https://ucpages.uc.pt/celga-iltec/crowll/ for further information on the project.
2 SKELL is a free language learning tool that provides automatic summaries of corpus data, 
namely, examples, collocations and thesaurus. Available at https://skell.sketchengine.eu 
(30. 8. 2022).
65
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
task, have a direct impact on the quality of the output. Consequently, 
our research question in this paper is: What is the methodology of data 
preparation that is required to attend to the needs of a crowdsourc-
ing game dealing with identification of offensive language, sensitive 
content and structural problems in authentic language material? We 
present the steps taken, the decisions made, the challenges faced and 
the solutions found to create the methodology for preparing a data-
set of 10,000 sentences per language to develop and internally test 
the CrowLL game. For that, we use three key elements: source cor-
pora, from where the sentences to be labeled by the players will be 
extracted; Good Dictionary Examples – GDEX (Kilgarriff et al., 2008) 
configurations, which automatically identify more pedagogically-suited 
examples in the source corpora and assign scores to the sentences; 
and lemma lists, which define the sentences to be extracted from the 
corpora. After the game is developed and tested with real users, the 
methodology of data preparation itself can also be evaluated.
The paper builds on our previous work within the enetCollect COST 
action.3 We have previously established the motivation for a gamified 
approach to the labeling of examples in pedagogical corpora. We have 
developed the idea, formulated research questions, conducted initial 
tests with the crowd to establish the suitability of the crowdsourc-
ing task, and used the lessons learned to design both the game flow 
and a work plan for the implementation. We have presented different 
stages of this work at conferences, as available in Kuhn et al. (2021) 
and Zviel-Girshin et al. (2021). In this paper, we focus on the newest 
development, namely on the first stage of the game preparation that 
primarily addresses issues related to the (corpus) data needed for the 
game. While the paper builds upon our previous work, it also presents a 
new, summative view and describes various applicative methodologi-
cal decisions that were tested on different languages to ensure further 
usability of our proposed model, both by other languages and for pur-
poses other than the CrowLL game development. 
This paper is structured as follows. Section 2 reviews different ap-
proaches to the identification of good examples for the creation of ped-
agogical corpora. Section 3 introduces crowdsourcing and gamification, 
3 https://www.cost.eu/actions/CA16105/ (28. 10. 2022)
66
Slovenščina 2.0, 2022 (2) | Articles
specifically within the context of language learning. Section 4 presents 
the CrowLL game, firstly reporting on our previous crowdsourcing ex-
periment, whose results have led to the adoption of the Games with a 
Purpose (von Ahn, 2006) approach. Section 5 describes the methodol-
ogy for data preparation in detail, and Section 6 analyzes and discusses 
the results.
2 Pedagogical corpora and language examples
Text corpora are collections of authentic (written or spoken) texts in 
electronic form, sampled to represent a specific type of language use 
(e.g. Gries, 2009; Sinclair, 2005). Corpus texts are typically equipped 
with metadata and linguistic information on different levels, increas-
ing their value for different purposes in applied linguistics, natural 
language processing, and other fields that benefit from analyzing lan-
guage data. In this paper, we focus on the field of language educa-
tion, where the importance and value of corpora have been firmly es-
tablished (Boulton, 2017; Callies, 2019; Römer, 2009; Vyatkina and 
Boulton, 2017). Corpora can be used by researchers and teachers 
for the creation of teaching and testing materials, language resources 
(such as learners’ dictionaries), or directly by students, as classroom 
work with authentic language facilitates bottom-up language learning 
(Osborne, 2002).
It has been established (e.g. Callies, 2019) that direct use of cor-
pora for teaching purposes is still not very widespread for a series of 
reasons, among which is skepticism about the quality and appropriate-
ness of the data, especially because corpora are usually compiled for 
carrying out research, not for language teaching. Attempts to address 
this problem and promote the use of corpora for teaching have led to 
the emergence of specialized pedagogical corpora, i.e., corpora pre-
pared specifically for language learning purposes (Chambers, 2016, p. 
364). One of the main characteristics of a pedagogical corpus is the 
need for “pedagogic mediation” (Braun, 2005), which takes into con-
sideration a set of factors related to the learners and the learning con-
text. For purposes of good example selection, for instance, we argue 
that one type of monitoring could focus on identification of possible 
67
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
structural (grammar and spelling) problems as well as sensitive/offen-
sive content, which might be problematic when presented to learners 
without the mediation of the teacher.
The creation of pedagogical corpora is a costly and time-consuming 
endeavor; however, the process can be supported by the automatiza-
tion of certain procedures. One possible approach is to clean elements 
considered to be problematic for pedagogical purposes from existing 
corpora, such as offensive words and structural errors (misspellings, 
grammar errors). 
In reference to the former, one area that has invested extensively in 
the identification of offensive language is natural language processing 
(NLP), mainly with research on the automatic detection of hate speech, 
with the aim to contribute to monitoring abusive behavior on the inter-
net (e.g., social media, comments on media channels). Some exam-
ples of efforts on this topic are specific evaluation tasks at SemEval 
(International Workshop on Semantic Evaluation),4 such as OffensEval5 
(Zampieri et  al., 2019; Zampieri et  al., 2020), and the Workshop on 
Online Abuse and Harms (WOAH),6 currently in its 6th edition (2022). 
An impressive amount of research on the subject has been carried out 
in NLP, as can be seen, for example, in Poletto et al. (2020). This survey 
presents an up-to-date, systematic review of the available resources 
on hate speech, with detailed analysis, some of the current weakness-
es, and goals for improvement. According to the authors, it is a com-
plement to previous surveys, in particular, Lucas (2014), Wiegand and 
Schmidt (2017), and Fontana and Nunes (2018) (Poletto et al., 2020, 
p. 479). Datasets, such as the ones available on the dedicated web-
page Hate Speech Dataset Catalogue (Vidgen and Derczynski, 2020),7 
and lexica, such as HurtLex (Bassignana et al., 2018), are some of the 
resources developed in NLP that could be used as a source of keywords 
for corpus cleaning. This approach consists of using blacklists contain-
ing swear words, vulgarisms, and words related to sensitive content in 
order to remove from the corpus sentences where these words occur 
(see below for a combined use of blacklists and GDEX). That means 
4 https://semeval.github.io/ (28. 10. 2022)
5 https://sites.google.com/site/offensevalsharedtask/home (28. 10. 2022)
6 https://www.workshopononlineabuse.com (28. 10. 2022)
7 https://hatespeechdata.com/ (28. 10. 2022)
68
Slovenščina 2.0, 2022 (2) | Articles
the “clean” corpus would not contain any sentences with those words. 
Another contribution from NLP to corpus cleaning would be through 
the application of offensive identification models at the sentence level, 
thus eliminating from the source corpus sentences automatically iden-
tified as offensive. However, one of the challenges in computational ap-
proaches to this subject is that other aspects, above and beyond the 
linguistic surface, have a crucial influence in the determination of what 
offensiveness is. Schmidt and Wiegand (2017) present a few works that 
seek to incorporate context to hate speech detection, but acknowledge 
that in certain difficult cases the method fails, so more investigation is 
needed. Relatedly, Poletto et al. (2020) point out a shortcoming of not 
considering the pragmatic aspects of swearing when evaluating hate 
speech – the production of false positives. 
Whatever perspective is adopted with regard to identifying offen-
siveness, either at a word or sentence level, we have argued (Kuhn 
et  al., 2021) that the total elimination of sentences from the corpus 
should be avoided because: 1. very few words are problematic in all 
of their senses and contexts, and 2. teachers and didactic material de-
velopers should be free to use whatever examples they find useful for 
their various needs. We thus propose to label potentially problematic 
data in pedagogical corpora instead of removing it.
For structural errors, automatic error detection (following differ-
ent methods), has been widely adopted. For instance, Reynaert (2006) 
adopts a corpus-induced corpus clean-up approach to detect typos in 
texts. Rather than dictionaries, the lexicon used in the clean-up process 
consists of typos found in large corpora. However, Xu and Chamberlain 
(2020) have shown that some problems identified as structural errors 
by automatic error detection methods might not be actual mistakes, 
but rather spelling and grammatical variations based on the context of 
use. They argue that humans are still required to perform the clean-up 
task, and thus developed a game (Cipher) in which players are asked to 
identify different types of errors in texts and annotate them.
A more lexically-oriented approach to the compilation of pedagog-
ical corpora refers to the adoption of sophisticated methods that auto-
matically analyze texts according to several criteria to identify good ex-
amples. These good examples can then be gathered in a pedagogical 
69
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
corpus. The current state-of-the-art in corpus linguistics is Good Dic-
tionary Examples (GDEX) (Kilgarriff et al., 2008), available as a feature 
in the Sketch Engine (Kilgarriff et al., 2004, 2014) corpus query system. 
The general idea of GDEX is to provide a list of suitable, good-quality 
candidate corpus sentences that lexicographers can directly add into 
the dictionary as illustrative examples. At the heart of GDEX is a rule-
based formula that assigns a numerical score to each corpus sentence 
based on how well it meets the pre-defined criteria. The criteria can de-
termine, for instance, the length of the sentence, the number of words 
in the sentence, the frequency of word forms or lemmas in the corpus, 
the presence or absence of certain elements in the sentence, and so 
on. The scoring formula (with additional parameters) constitutes a so-
called GDEX configuration. There are two groups of classifiers used in 
the configuration: hard and soft. Hard classifiers include a very high 
penalty giving sentences a very low score, resulting in pushing them 
to the bottom of the candidate list. Soft classifiers either penalize sen-
tences or award bonus points, helping to rank good dictionary example 
candidates. As a result, GDEX lists all example candidates in descend-
ing order and can also be used to filter out all sentences below a certain 
threshold (Kosem et al., 2019).
A GDEX-based methodology has already been used to create 
pedagogical SKELL (Sketch Engine for Language Learning) corpora 
for Russian, Estonian (Koppel, Kallas, et al., 2019), English, German, 
Italian and Czech. This entails filtering a source corpus with a GDEX 
configuration, leaving only the sentences that meet all the criteria of 
good dictionary examples and removing the rest. But creating corpora 
by eliminating data brings out the shortcomings we mention earlier 
in this paper. The English noun ass, for example, can refer either to a 
body part, a donkey or a stupid/annoying person.8 Since in some in-
stances it may be considered problematic, it might be added to the 
blacklist. In that case, all sentences containing the word ass are re-
moved from the corpus regardless of the word’s meaning. This is not 
ideal for either lexicographers, who want to illustrate all the mean-
ings of a word in a dictionary, or teachers, who should be given the 
choice to decide what they want to use for teaching, considering the 
8 https://www.macmillandictionary.com/dictionary/british/ass (30. 8. 2022).
70
Slovenščina 2.0, 2022 (2) | Articles
students’ characteristics, such as level, age, and background and rel-
evance to the course topic.
Building on GDEX, Stanković et al. (2019) adopted machine learn-
ing to identify good candidate examples for Serbian. First, they ana-
lyzed lexical and syntactic features in a corpus compiled of illustrative 
examples from the five digitized volumes of the Serbian Academy of 
Sciences and Arts (SASA) dictionary. They then identified 14 features 
relevant for the task (character-based, token-based and syntactic fea-
tures) and prepared a gold dataset of good examples. Sentences from 
the prepared dataset, represented as feature-vectors, were used for a 
supervised machine learning model, which was then used in a GDEX 
classifier for contemporary Serbian sentences. A decision-tree classi-
fier trained on the data predicted whether a certain corpus sentence 
is a good candidate for an illustrative example for the given dictionary 
headword or not, with an accuracy of 83% for both positive and nega-
tive samples (Šandrih, 2020).
Another tool to automatically identify good examples based on a 
series of criteria and using both rule-based and machine-learning ap-
proaches is HitEx. The combined approach was designed to assess the 
readability and suitability of (initially coursebook) material for teach-
ing Swedish as L2 (Pilán et  al., 2013, 2014; Pilán et  al., 2016). For 
this task, 61 features of different types were used: length-based (e.g. 
number of tokens and characters), lexical (e.g. CEFR9-annotated word-
lists), morphological (e.g. part-of-speech), syntactic (dependency re-
lation tags), and semantic features (e.g. number of senses of a spe-
cific word). Candidate sentences were first ranked according to these 
features, and the 100 highest-ranked sentences were given to the 
machine-learning model for classification. The sentences were clas-
sified according to their proposed suitability for students at a certain 
CEFR level, and returned in the order of their heuristic ranking. Using 
the complete feature set at the document level, the tool obtained 81% 
accuracy, however, the classification accuracy for sentences was only 
63.4%, presumably because the amount of context was too limited for 
the features to capture differences between the sentences.
9 Council of Europe: Common European Framework of Reference for Languages: Learning, 
Teaching, Assessment. Cambridge University Press (2001).
71
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
Taken together, it can be concluded that the creation of peda-
gogical corpora can be challenging in at least two ways: 1. manually 
monitoring large amounts of texts is extremely time-consuming, and 
consequently, expensive; and 2. automatization of processes to sup-
port compilation has limitations due to the very nature of language. 
As mentioned above, one of the main shortcomings of rule-based 
approaches to automatic corpus cleaning, such as the method used 
for the development of SKELL corpora, lies in the fact that many of 
the words in the blacklists used as a reference to exclude sentences 
from a corpus are polysemic. Moreover, the automatic identification of 
structural problems does not take into consideration language varia-
tion. Finally, the NLP field has acknowledged that further investigation 
and development are needed in order to include contextual aspects 
to automatic offensiveness identification, with current methods still 
falling short.
As a result, human verification of sentences is required. More im-
portantly, from our perspective, pedagogical corpora should be labeled 
for potentially problematic content rather than cleaned from it. In order 
to streamline the verification of the sentences for the creation of prob-
lem-labeled pedagogical corpora, we have decided to ask the crowd 
for help. It was in this context that the Crowdsourcing Corpus Filtering 
for Pedagogical Purposes project was created.
3 Crowdsourcing and gamification
Crowdsourcing is a technique for gathering data or performing large-
scale tasks which is often based on the framework of collective intel-
ligence (Lévy, 1997). Concepts related to crowdsourcing include co-
creation, open innovation, and user innovation (Chesbrough, 2006; 
Prahalad and Ramaswamy, 2000; Von Hippel and Katz, 2003). The 
benefits of crowdsourcing have been thoroughly established (Aitamur-
to et al., 2011; Buecheler et al., 2010; Lew, 2014; Morschheuser et al., 
2017; von Ahn and Dabbish, 2008), and success stories can be found in 
various fields, from astronomy (e.g. Zooniverse; Simpson et al., 2014) 
to business. Language-related use of crowdsourcing is found in NLP 
(e.g. for tasks such as named entity recognition and entity linking), but 
72
Slovenščina 2.0, 2022 (2) | Articles
also in fields such as lexicography (e.g. Arhar Holdt et al., 2018; Kosem 
et al., 2018) and more recently in language learning.
The role of crowdsourcing and its potential in language education 
has been investigated by enetCollect (the European Network for Com-
bining Language Learning and Crowdsourcing Techniques), a large Eu-
ropean network project funded as a COST action. The action addressed 
the pan-European challenge of fostering the language skills of all citi-
zens regardless of their social, educational, and linguistic backgrounds. 
Its focus was on exploring the possibilities of how to use crowdsourc-
ing to enhance the production of learning materials to cope with both 
the increase in demand for learning a second language (for migration, 
business, and tourism purposes), and the demand for more accessible 
materials in the many languages that are of interest to learners.
As the enetCollect research has confirmed, combining crowd-
sourcing and language learning is not a new undertaking, and it is 
possible to merge them to mass-produce language resources for 
any language in which a crowd of language learners can be involved 
(Arhar Holdt et al., 2021; Bédi et al., 2019; Lyding et al., 2018; Nico-
las et al., 2020). Several language learning portals based on crowd-
sourcing have gathered huge multilingual audiences. Although this 
paper is not the platform for a detailed presentation of any of these 
portals, we offer some data to provide an insight into the scale of 
the crowd they were able to reach between 2017–2018 (Gorovaia, 
2018). Rosetta Stone, the oldest of the portals and founded in 1992, 
attracted 75,720,000 users. Babbel, which opened in 2007, gathered 
20,000,000 users. Mango Languages, launched in 2007, attracted 
300,000 users. LiveMocha, which began in 2007, had 12,000,000 
users in 2016. Busuu, which started in 2008, reached an audience 
of 70,000,000, while Duolingo, launched in 2011, had 300,000,000. 
Duolingo is notable for having built one of the world’s most popu-
lar language-learning apps while hiring only a handful of language 
experts. Each day, it provides millions of sentence examples and 
exercises to users, almost all of them created by its 300 million or 
so volunteers. All of these portals are educational business entities, 
which confirms that educational businesses are able to attract users. 
The content they provide may facilitate and improve teaching, and 
73
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
crowdsourcing may be used to help to create resources for additional 
educational areas or new languages.
An important aspect of crowdsourcing is crowdsourcer motivation, 
i.e. finding the best method for a specific crowdsourcing task that will 
attract enough people and ensure their participation until the end of 
the task. Lew (2014) states there are three types of motivation: psy-
chological, social, and economic. Psychological motivation is driven 
by the expectation that participants will find the task psychologically 
satisfying or personally fulfilling. Social motivation relies on the desire 
of individuals to interact with others who share similar interests, con-
tribute to the community, or improve a certain skill. Economic moti-
vation involves financial benefits for the participants who can, for ex-
ample, receive micropayments for successfully completed tasks (see 
 Rumshisky, 2011).
A method that relies heavily on the psychological motivation of the 
participants, and aims to make completing the task pleasurable, is a 
game with a purpose (GWAP). GWAPs are “games that are fun to play 
and at the same time collect useful data for tasks that computers can-
not yet perform” (Hacker & von Ahn, 2009, p. 1208). They have been 
increasingly used to crowdsource data to create lexical infrastructures 
of different types, and examples of GWAP include Dodiom (Eryiğit et al., 
2022), Jeux de Mots (Lafourcade, 2007), Phrase Detective (Chamber-
lain et al., 2008), ZombiLingo (Guillaume et al., 2016), Jinx (Seemakur-
ty et al., 2010), Game of Words (Arhar Holdt et al., 2020), and Cipher 
(Xu and Chamberlain, 2020). 
In sum, when applied in the right circumstances, to the right 
crowd, and using a method and motivation best suited for a specific 
task, crowdsourcing can deliver very useful outcomes. It is, however, 
important to note that successful completion of a crowdsourcing task 
also requires a careful analysis of the related goals, the problem-solv-
ing environment, the expertise required, complementary activities and 
capabilities, and the competitive environment (Aitamurto et al., 2011; 
Morschheuser et al., 2017; Pe-Than et al., 2015).10
10 There is evidence that crowdsourcing tasks are sometimes not well-defined, or are given to 
the “wrong” unskilled/untrained crowd that cannot complete the task.
74
Slovenščina 2.0, 2022 (2) | Articles
4 The crowdsourcing for language learning game − 
CrowLL
4.1 Background
In 2019 we carried out an experiment on the use of crowdsourcing for 
corpus filtering in which we asked the crowd to identify offensive sen-
tences for pedagogical purposes (Kuhn et al., 2021). The sentences to 
be judged were automatically extracted from corpora of Brazilian Por-
tuguese, Dutch, Serbian, and Slovene, and the participants were from 
Brazil, Netherlands, Serbia, and Slovenia, respectively. This study has 
revealed that the crowd considered to be offensive sentences which, al-
though not directly formulated as such, expressed misogyny, religiously-
offensive content, violence towards children, or contained topics related 
to war and politics. The study has also shown that sentences with explic-
itly rude content were not necessarily considered to be inappropriate. 
These revealing results support our understanding that offensive-
ness and sensitivity are subjective and that their expression through 
language involves mechanisms that go beyond the explicit use of 
swear words. The findings of the experiment have also indicated that 
crowdsourcing seems to be an adequate technique to deal with such 
a contentious topic. Nevertheless, the traditional approach used in the 
experiment, namely, via the Pybossa crowdsourcing platform,11 was 
considered to be rather unappealing by the participants, and thus we 
decided to experiment with the Games with a Purpose approach. This 
has also been adopted to address a similar topic by High School Super 
Hero (Bonetti and Tonelli, 2020, 2021), a game currently under devel-
opment that focuses on the linguistic annotation of abusive language 
to collect data for hate speech detection. However, while GWAPs have 
been used for various purposes in different fields (cf. section 3), the 
use of games to monitor offensiveness and sensitive content in authen-
tic examples is still in its infancy. 
One additional point should be made. Given that some participants 
in our experiment considered sentences with structural problems in-
appropriate for language learning, we decided to include this type of 
problem in the game, in addition to offensiveness and sensitive content.
11 https://pybossa.com (30. 8. 2022)
75
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
4.2 CrowLL
The Crowdsourcing for Language Learning (CrowLL) game is under de-
velopment for Brazilian Portuguese,12 Dutch, Estonian, and Slovene. 
The idea for CrowLL was originally inspired by the Matchin game (Hack-
er and von Ahn, 2009). In this, two players compete with each other to 
guess which of the two pictures that are shown to them their opponent 
will choose. If their predictions match, they score points. According to 
Hacker and von Ahn (2009), this game mechanism can be used to elicit 
user preferences. Harris (2014) has also shown that asking about the 
partner’s opinion leads to better results with regard to both parties giv-
ing the same answers than when the players make decisions based on 
their own opinions. Given that our interest in the game is to find out 
what examples players consider to be offensive, have sensitive con-
tent or have structural problems, this in fact includes asking players 
to make judgements that can vary from one person to another. Thus, 
the selection of a game mechanism that elicits the users’ opinions and 
preferences seems to be a viable solution. 
Nevertheless, we have also opted to offer a single-player mode. Al-
though with this mode, the game might not benefit from the advantag-
es put forth by the dual-player mode, the organizational factors have 
led us to opt to start with the development of the solo mode. Namely, 
the computational implementation of the solo mode requires less time 
and is, consequently, less expensive. 
In terms of the type of crowdsourced work, Morschheuser et  al. 
(2017) propose a categorization of crowdsourcing types based on the 
framework presented by Geiger and Schader (2014). Based on this, 
we consider CrowLL as a crowdrating game, given that “crowdrating 
systems commonly seek to harness the so-called wisdom of crowds 
(Surowiecki, 2005) to perform collective assessments or predictions. 
In this case, the emergent value arises from a huge number of homo-
geneous ‘votes’” (Morschheuser et al., 2017, p. 27). 
With CrowLL, the definition of whether a sentence is problematic or 
not, to which category of problem it belongs, and what constituent part 
of the sentence is problematic will emerge from the majority consensus.
12 European Portuguese will be included later.
76
Slovenščina 2.0, 2022 (2) | Articles
CrowLL will be a collaborative game with three levels. In level 1 
(I’m curious!), players identify appropriate sentences for language 
teaching (Figure 1). In level 2 (I’m eager to help!), they categorize the 
sentences that have not been chosen (i.e., considered to be inappropri-
ate), ranging from grammar/spelling problems to issues of offensive-
ness and sensitivity (Figure 2). In level 3 (I’m feeling enthusiastic!), 
players mark in the sentence what they consider to be problematic. 
Players can choose to play the full game cycle (all levels), a combina-
tion of two levels, or only one level.
Figure 1: Levels 1 and 2 of CrowLL.
Initially, the dual-player mode should involve two human play-
ers. However, ‘the cold-start problem’, i.e., the lack of an opponent to 
start a game (Dulačka et al., 2012 as cited in Pe-than et al., 2015) has 
made us think of alternatives. Indeed, it can be a challenge to find a 
playing partner at any given time, especially in the case of small lan-
guage communities such as some of those for which this game is being 
developed. Therefore, we propose two solutions, a synchronous and 
an asynchronous mode. In the synchronous mode, players will play 
against bots with pre-recorded answers. Players are rewarded when 
their predictions match the pre-recorded answers. With the asynchro-
nous approach, we will offer delay mechanics (Pe-Than et al., 2015). 
Here, players will choose packages containing sentences previously 
judged by others, and players will be rewarded once their answers are 
confirmed by others at a later time. Depending on whether the game is 
77
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
played in single- or dual-player mode, some of the questions will have 
to be changed and the scoring will also be different. 
We have several ideas with regard to incentive through scoring 
mechanisms, ranging from offering an individual score that stems from 
consecutive work, to keeping a record of a cooperative score that shows 
the agreement of the player in teams/partnerships (so-called normali-
zation motivation, according to Preist et al., 2014), including displaying 
scoreboards of the player’s country’s ranking position in comparison to 
the other countries (Olympic Games style). In this way, the game can 
be competitive on an individual level, while at the same time coopera-
tive on the team level.
5 Methodology of data preparation
In order to start testing the game so that adjustments and develop-
ment can be made before the official public release, we have decided 
to create an initial dataset of 10,000 sentences per language. The 
data extraction procedure involves – from each of the source corpora 
– the use of GDEX and a lemma list to extract the sentences. How-
ever, before proceeding with the extraction, a series of actions are 
required:
1. Definition of the source corpora from which sentences will be 
extracted;
2. Provision of pedagogically oriented GDEX configurations;
3. Creation of lemma lists to extract sentences from the corpora.
Next, we will explain each action in more detail.
5.1 Source corpora
One of the crucial guidelines for choosing our source corpora was that 
they were at least in some part openly available. This way, the resulting 
labeled datasets can be shared with and used by others. This decision 
aims at contributing to overcome one of the main problems in the area 
of language resource development, namely the lack of open-source 
data for many languages, as noted, for example, by Vajjala (2022) with 
regard to research on automatic readability assessment. 
78
Slovenščina 2.0, 2022 (2) | Articles
For Dutch and Brazilian Portuguese, we use the respective corpora 
of the Timestamped JSI web corpus, which is a family of web corpora 
created from IJS newsfeed by the Jozef Stefan Institute, in Slovenia, 
for 18 languages (Trampuš and Novak, 2012). Corpora in this family 
comprise news articles continuously crawled from RSS feeds. Both 
corpora are available in Sketch Engine. The Dutch corpus covers texts 
originating from the Netherlands and Belgium from 2014 to 2021. The 
whole corpus, totaling approximately 1.3 billion words, will be used. 
The Portuguese corpus covers texts from 2014 to 2021, published 
online in different countries, totaling over 4.5 billion words. As we are 
first developing CrowLL for Brazilian Portuguese, we only used texts 
marked with Brazil as a source country, thus making a subcorpus of 
3,202,820,993 words.
For Estonian we use the Estonian National Corpus 2021 (Koppel 
and Kallas, 2022), which is the latest and largest corpus of written texts 
of modern Estonian. The texts span the period from 1990 to 2021. The 
most extensive part of the Estonian National Corpus 2021 is the Esto-
nian Web Corpora, i.e. texts crawled from the web. It contains eleven 
sub-corpora (i.e. Web 2013, Web 2017, Web 2019, Web 2021, Feeds 
2014-2021, Wikipedia 2021, Wikipedia Talk 2017, the Open Access 
Journals (DOAJ), Literature, Balanced Corpus, and the Reference Cor-
pus) totaling 2.3 billion words.
For Slovene we use Gigafida 2.0 (Krek et al., 2020), the most recent 
version of the reference written corpus of Slovene. It contains 38,310 
texts and 1,134,693,333 words. The texts span the period from 1991 
to 2018, and cover newspapers, internet resources (the texts collected 
using the IJS Newsfeed service; Trampuš and Novak, 2012), maga-
zines, fiction, non-fiction (such as textbooks), and various other texts. 
Newspaper texts represent nearly half of the corpus (47.8% of tokens), 
followed by internet texts (28%) and magazines (16,5%).
5.2 Pedagogically oriented GDEX configurations
In section 2, we introduced GDEX (Good Dictionary Examples) (Kil-
garriff et al., 2008). While the Sketch Engine team has made gener-
al GDEX configurations for a number of languages available on their 
79
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
platform, GDEX configurations can be specially devised to better fit 
specific purposes, depending on the objectives of the project at hand. 
As the objective of the CrowLL game is to have the crowd help to create 
problem-labeled corpora for language learning, the sentences to be 
presented to the crowd for labeling have to be previously prepared to 
fit the pedagogical purpose. In order to do this automatically, we have 
opted to use pedagogically oriented GDEX configurations.13 Slovene 
and Estonian have adopted configurations that have been previously 
devised for pedagogical purposes, while Dutch and Portuguese have 
built on existing pedagogically oriented configurations. 
The Slovene GDEX configuration was originally devised for lexico-
graphic projects at the Centre for Language Resources and Technolo-
gies, and more specifically this includes the Slovene Lexical Database 
and Collocations Dictionary of Modern Slovene (Gantar et  al., 2016; 
Kosem et al., 2011; Kosem et al., 2012; Kosem et al., 2013). The ini-
tial lexicographically oriented GDEX configuration was also used for 
pedagogical purposes, i.e. in the preparation of examples for exercises 
in the Pedagogical Corpus Grammar (Arhar Holdt et al., 2011; Arhar 
Holdt et al., 2017).
The Estonian configuration was originally devised for extracting 
examples for the Estonian Collocations Dictionary (Kallas et al., 2015) 
aimed at learners of Estonian as a foreign language on the B2-C1 level. 
The configuration was later used to create a corpus – the etSkELL corpus 
– that only includes sentences that meet all the pre-defined criteria (i.e. 
have a GDEX score above 0.5). The etSkELL corpus is now also used as 
a source corpus in the Estonian SKELL, as well as in the language portal 
Sõnaveeb for presenting the users a set of authentic corpus examples 
(Koppel, Kallas et al., 2019; Koppel, Tavast et al., 2019; Koppel, 2020).
For Dutch, special GDEX configurations were developed in the con-
text of the project Woordcombinaties14 (Word combinations) which is 
13 While we are aware that some fields of the NLP area are devoted to related issues that could 
potentially contribute to the automatic identification of pedagogical sentences or even to 
enhancing GDEX configurations, such as automatic normalization, automatic error detec-
tion, and readability assessment, a decision was made to adopt or adapt existing versions of 
GDEX configurations as a first step towards identifying candidate sentences for pedagogical 
purposes. Moreover, and relatedly, it is outside the scope of this paper to explore other ap-
proaches to further enhance GDEX configurations. 
14 https://woordcombinaties.ivdnt.org/ 
80
Slovenščina 2.0, 2022 (2) | Articles
targeted at advanced language learners (Colman and Tiberius, 2018). 
For this project, a minimal configuration was defined only using the 
classifiers not surrounded by round brackets in Table 1, as well as a 
more restrictive configuration also incorporating the classifiers in be-
tween brackets. Lexicographers in the project Woordcombinaties have 
access to both configurations, and both are being used. For the initial 
dataset for CrowLL a combination of the two configurations will be 
used, to bring the Dutch configuration more in line with the configura-
tions for the other languages.
The GDEX configuration that was devised for academic Portuguese 
in the context of a design of a dictionary for university students (Kuhn, 
2017) is the basis for the development of the configuration for data ex-
traction. Given the pedagogical aspect of the academic configuration, 
adjustments were mostly made according to the characteristics of the 
type of language, i.e., from academic to general language. Additional 
development might take place in the future. 
Out of the four languages, Estonian has carried out a study espe-
cially developed to evaluate its GDEX configuration, while the other 
languages have relied on the successful and extensive use of the con-
figurations by lexicographers and other users. The output of the Es-
tonian GDEX configuration has been assessed by lexicographers and 
L2 learners of Estonian. The two types of annotators performed a task 
to determine whether authentic and unedited corpus sentences would 
be suitable as example sentences for learners’ dictionaries on the B2-
C1 level. The results of the assessment showed that both types of an-
notators considered as many as 85% of the corpus sentences chosen 
by the Estonian GDEX configuration as good examples, confirming the 
premise that the methodology GDEX uses to select the examples is 
reliable (Koppel, 2019). The pre-existing Slovene GDEX configuration 
adopted in our methodology has been widely tested by lexicographers 
and successfully implemented in the development of other resources, 
such as a pedagogical grammar, as noted above. For Dutch, the con-
figuration used is a combination of two configurations that have been 
tested extensively by a team of lexicographers within the Woordcom-
binaties project. The Portuguese GDEX configuration for the game is 
actually the only one that has not been previously tested, as it consists 
81
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
of an adaptation of an existing configuration. However, the configura-
tion that was used as the basis has been carefully devised and used by 
other users (for example, when integrated in the Sketch Engine tool). 
As mentioned in section 2, GDEX configurations consist of two 
types of classifiers: hard and soft. Sentences are evaluated against 
those classifiers and scores are calculated accordingly, based on the 
weighted sum. Hard classifiers serve to severely penalize sentences, 
separating the good from the (really) bad ones. Soft classifiers, on the 
other hand, penalize or give bonuses to the sentences, thus contrib-
uting to ranking qualitatively more similar sentences. For the present 
project, some classifiers are used in all languages, while others are lan-
guage-dependent. Table 1 provides an overview of the classifiers used 
in the configurations of the four languages of the game.
Hard classifiers (in bold in Table 1) mean that the evaluation of 
these features in the sentences weighs heavily on their score. A sen-
tence must start with a capital letter and finish with a period, an excla-
mation mark or a question mark to be considered a whole sentence. 
For pedagogical purposes, it is crucial that only whole sentences are 
extracted from the source corpora. The blacklist – illegal characters 
classifier is used to detect the sentences containing strings with un-
wanted characters such as parts of the program code (<tag>) or URLs 
(//), because such sentences are not wanted in pedagogically oriented 
content. Spam texts are usually machine-generated, and thus are not 
appropriate for language learning. With the blacklist – spam classifier, 
sentences containing words in this blacklist get a very low score. In 
addition to spam texts, other characteristics of texts found on the web 
can be counterproductive for pedagogical purposes, such as the pres-
ence of typos and misspellings. In order to filter those sentences out, 
a minimum frequency for tokens is established. Another aspect to be 
considered in a pedagogical example is its length. Very long sentences 
can compromise intelligibility, i.e., “examples that are intelligible (to 
the users) are those that are not too long and do not contain complex 
syntax or rare or specialized vocabulary” (Kosem et al., 2019, p.120), 
while very short sentences might lack context and lose informative val-
ue (ibid.). Thus, sentences that do not fit between the minimum and 
maximum sentence length values get a high penalty. 
82
Slovenščina 2.0, 2022 (2) | Articles
Table 1: Overview of the classifiers used in pedagogically oriented configurations for Slo-
vene, Dutch, Estonian and Brazilian Portuguese (adapted from Kosem et al., 2019)
Classifier Slovene Dutch Estonian Brazilian
Portuguese
whole sentence X X X X
blacklist - illegal 
characters
X X X X
blacklist - spam X X X
minimum frequency for 
tokens
X (3)  X (20) X (5) X (5)
minimum and maximum 
sentence length
X (7 and 60)  X (<30) X (4 and 20) X (7-30)
graylist – bad words X (X) X X
optimal sentence length X (15-40 
tokens)
X (9-12 
tokens)
X (6-12 
tokens)
X (10-18 
tokens)
penalty for long words X (longer than 
12 characters)
 X (longer than 
12 characters)
penalty for rare 
characters
X X X X
penalty for capital letters X X (part of rare 
characters)
X
penalty for tokens with 
mixed symbols
X X X X
penalty for proper nouns X (X) X  X
penalty for pronouns X X  
penalty for sentence 
initial words
X (list of 
words 
provided)
(X) X  
penalty for sentence 
initial phrase
X  (X) X  
penalty for sentence 
initial tags
 (X) X  
penalty for rare words X (fewer than 
1,000 hits in 
the corpus)
 (X) X (fewer than 
1,000 hits in 
the corpus)
X (fewer than 
500 hits in the 
corpus)
penalty for commas X (3 or more)  X (2 or more) X (2 or more)
penalty for abbreviations (X)
penalty for sentences 
without a finite verb
 X  
penalty for more than 
two occurrences of que 
(that, which)
   X
83
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
As can be seen in Table 1, the use of soft classifiers (in non-bold in 
Table 1) varies among the languages, with optimal sentence length and 
graylist – bad words being used in all of them. Sentences within the 
optimal sentence length get a higher score than the other sentences 
outside this interval, and are thus ranked higher up among all the sen-
tences. Length values vary from language to language, and have been 
defined based on what each language considered to be the optimal 
sentence length interval for pedagogical purposes.15
Words in the graylist – bad words are compared against the sen-
tences in a corpus and, if any word is found, the sentence is penalized. 
Evaluation of the settings has shown that this penalization is enough 
to push such potentially problematic sentences lower down the rank-
ing, but still not too low in case the penalization is unjustified (polyse-
mous words, etc.). This means that sentences with higher scores (in 
the upper part of the list) will probably not contain explicitly offensive 
words, that sentences with very low scores (at the bottom of the list) 
will probably contain offensive words, and that the ones in the middle 
might or might not contain them. While we want the players to assess 
the sentences from the upper and lower parts and possibly confirm 
that they are non-problematic and problematic, respectively, one of 
the most interesting contributions from the players will be the evalu-
ation of sentences pertaining to exactly this grey, middle area, where 
one can expect to find explicitly offensive lemmas, offensive lemmas 
that are polysemous and not being used in an offensive manner, offen-
sive sentences with no overtly offensive lemmas, and sentences with 
sensitive content. This type of evaluation is still not well performed by 
computers, so we need humans to do it.
The Slovenian graylist contains 1,909 words (nouns, adjectives, 
verbs and adverbs) that were identified in several lexicographic and lin-
guistic projects as vulgar or (potentially) offensive. For Portuguese, there 
are two graylists of explicitly offensive and vulgar items (nouns, adjectives 
and verbs), one consisting of lemmas and another one of word forms and 
strings (e.g., fodid.+), totaling 91 items. These lists result from manual 
15 It was observed that different languages differ in the average sentence length due to various 
reasons such as word formation (e.g. compounds in Estonian are mainly written as one word, 
as opposed to two or more words in Slovene), existence of articles etc.
84
Slovenščina 2.0, 2022 (2) | Articles
evaluation and editing of the list of taboo lemmas and word forms creat-
ed by the Sketch Engine team for the default Portuguese GDEX that they 
have devised. Words related to cultural aspects, such as those related to 
religion or nationalities, that were not offensive or vulgar but had prob-
ably been included because of their potential to spark hate speech, were 
discarded. In addition, new offensive or vulgar items were added, but 
further editing can be carried out if necessary. The Estonian graylist con-
tains 1,472 words (nouns, adjectives, verbs), consisting of words tagged 
as vulgar, offensive, colloquial, and slang in the EKI Combined Dictionary 
(Langemets et al., 2022), swear words in foreign languages  (e.g. fuck), 
their adapted variants (e.g. fakk, pohui ‘похуй’), and words written dif-
ferently from the written language norm. The Dutch configuration uses a 
graylist of 93 words which is based on words labeled as vulgar or offen-
sive in the Algemeen Nederlands Woordenboek.16 If needed, the Dutch 
graylist will be further refined in the future.17
Other classifiers relevant in the context of language learning are 
penalties for long words, rare characters, tokens with mixed sym-
bols and capital letters. This is based on the assumption that long-
er words, too many rare characters and capital letters as well as the 
occurrence of non-words have an impact on reading complexity. For 
pedagogical purposes, a penalty can also be given to proper nouns in 
order to give priority to sentences without (or with few) of these, as in 
many cases the named entities in those sentences might not be known 
to the learners. The same applies to abbreviations which learners may 
not necessarily be familiar with. Penalizing pronouns can also help, as 
sentences with many pronouns are often too anaphoric and lack con-
text for proper understanding.
16 https://anw.ivdnt.org/search (30. 8. 2022). Note the ANW is a dictionary under construction, 
and thus new words (including words labelled as vulgar or offensive) are continuously being 
added. The current GDEX configuration for Dutch uses the words labelled as vulgar or offen-
sive in the ANW at the time the GDEX configuration was defined for the project Woordcombi-
naties.
17 As can be noticed, there is a considerable difference between the number of lemmas in the 
graylists for different languages. More thorough studies on problematic vocabulary were 
conducted for Slovenian and Estonian, and more extensive word lists were obtained as a 
result. It should be noted that these graylists contain lemmas that are problematic only in 
part, e.g. in one of their senses. Consequently, the penalization of sentence(s) containing the 
word(s) is milder. Using different approaches to graylists will open possibilities to compare 
them at the end of the study.
85
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
Another type of classifier uses lists containing words and phras-
es that should not occur in a sentence-initial position. These words 
and phrases are heavily penalized because in previous manual evalu-
ations of extracted sentences for Slovene, Estonian and Dutch, several 
sentence-initial words and phrases were identified that are a good 
signal that the sentence is contextually dependent on the previous 
sentence(s), and is thus less suitable to be used as a standalone com-
ponent for pedagogical purposes. Similarly, certain sentence-initial 
tags can be penalized, e.g. conjunctions, because sentences starting 
with conjunctions are often anaphoric. 
Furthermore, sentences containing less frequent words tend to be 
considered inadequate to serve as examples of language use in peda-
gogical contexts, as such words are likely not known to the learners 
and might act as a distraction. The penalty for rare words classifier 
penalizes sentences with words whose frequency is below a certain 
threshold, so these sentences get lower scores. The use of too many 
commas in a sentence might be indicative of complexity, so the  penalty 
for commas is a classifier that penalizes sentences if they have more 
than a defined number. A penalty for sentences without a finite verb 
can help to filter out less typical sentences. The grammar of the Esto-
nian language (Erelt and Metslang, 2017), for instance, states that a 
typical sentence contains a finite verb and phrases (collocations) that 
go with the verb.
Portuguese adopts a separate penalty for more than two occur-
rences of que (that, which). This classifier has been created to avoid 
sentences with too many subordinate or relative clauses, because high 
syntactic complexity makes understanding more difficult, which is 
something to be avoided in pedagogical examples.
5.3 Lemma lists
To ensure at least partial comparability of the multilingual results, we 
decided to extract the data using lemmata, comparable across the 
participating languages. For this purpose, we first prepared a list of 
100 words in English using the criteria described below. In the second 
step, we translated the list to Slovene, Brazilian Portuguese, Dutch, and 
86
Slovenščina 2.0, 2022 (2) | Articles
Estonian, reporting on problems with translation equivalents, as well 
as their frequency in the corresponding source corpora. We discuss 
some of these issues in Section 6.
We wanted to include lemmata that were of different relevance for 
labeling in the context of the CrowLL task: (a) words that were clearly 
(on the surface and in the vast majority of the meanings) offensive or 
vulgar, for example: nigger, whore, bitch, retarded, to fuck, to piss; (b) 
words that were offensive or vulgar in some of the meanings, as well 
as words with potentially sensitive content, for example: cow, drunk, 
suicide, fanatic, depressed, to molest; (c) words that would typically 
not be considered offensive, vulgar or sensitive from the perspective of 
our labeling task, for example year, world, service, new, to say, to see. 
Vocabulary from the first group would typically make it to blacklists, 
and thus a blacklist-based methodology would automatically filter out 
corpus occurrences with these words before they would be included in 
any teaching material. Here, we are including it to test the hypothesis 
that these corpus occurrences would also be marked as inappropriate 
by the crowd. On the other hand, non-problematic words are included 
to test the complementary premise. The most interesting for our task, 
however, are words in group (b). The lemmata list thus includes 20 
words from groups (a) and (c) and 60 words from group (b).
The seed lemmata were selected using the translation into English 
of a list of words that were identified during the creation of a GWAP 
called Game of Words (Arhar Holdt et al., 2021). This game prompts 
the players to provide synonyms and collocations for different Slovene 
words, with the implicit purpose to clean the noise from two automati-
cally created databases comprising openly available lexical informa-
tion for Slovene. As the game is aimed at young(er) users, not only 
vulgar and offensive words were removed from the list of potential 
prompts, but also words with sensitive content that could cause the 
player unnecessary discomfort. The criteria for removal were based on 
existing resources, such as dictionaries, and privately compiled lists 
by researchers or journalists (ibid., p. 43). Semantically, the removed 
words covered a) human features, such as race, nationality, gender, 
age, sexual orientation, religious and political beliefs, migration status, 
social status, education, handicap, bodily and mental features etc., as 
87
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
well as b) sensitive topics, such as violence, illness, death, addiction, 
sex, excretions, etc. Offensive, vulgar, and potentially sensitive words 
for CrowLL were selected based on these categories, while non-prob-
lematic words were chosen from the most frequent words in English 
Web 2020, available on Sketch Engine.
The majority (50) of the included lemmata are nouns, 25 are verbs 
and 25 are adjectives. An example of seed lemmata with labels and 
translations is provided in Table 2.
Table 2: Common lemma list and its translations to Slovene, Estonian, Brazilian Portuguese 
and Dutch
Category Type English POS Slovene  Estonian Brazilian 
Portuguese
Dutch
Race B black-
skinned
A temnopolt mustanahaline negro zwart
Race B native N domorodec pärismaalane índio autochtoon
Race B racist A rasističen rassistlik racista racistisch
Race A nigger N črnuh neeger crioulo neger
sexual 
orientation
B homosexual A homoseksu-
alen
homosek-
suaalne
homossexual homosek-
sueel
sexual 
orientation
B straight A heterosek-
sualen
heterosek-
suaalne
heterossexual heterosek-
sueel
sexual 
orientation
B lesbian A lezbičen lesbiline lésbica lesbisch
sexual 
orientation
A faggot N peder pede bicha flikker
violence B to murder V umoriti mõrvama assassinar vermoorden
violence B brutal A brutalen brutaalne brutal brutaal
violence B to bully V ustrahovati kiusama intimidar intimideren
violence B to torture V mučiti piinama torturar martelen
violence B to rape V posiliti vägistama estuprar verkrachten
violence B to beat V pretepati peksma bater slaan
violence B to molest V zlorabljati ahistama molestar lastigvallen
violence B to shoot V ustreliti tulistama atirar schieten
non-
problematic
C time N čas aeg tempo tijd
non-
problematic
C way N način viis maneira manier
non-
problematic
C to include V vključiti sisaldama incluir omvatten
non-
problematic
C good A dober hea bom goed
88
Slovenščina 2.0, 2022 (2) | Articles
As mentioned at the outset of this section, the data extraction proce-
dure to obtain 10,000 sentences is meant for game development and 
initial tests. More specifically, the procedure will be performed as fol-
lows. We will use GDEX configurations to extract the top 200 sentences 
per lemma of the lemma list so that we have a buffer in case of dupli-
cates. We will then verify those 20,000 sentences and reduce them to 
10,000 sentences per language.
Once we have this data, we will proceed with manual annotation of 
the sentences with the labels from the game (non-problematic/prob-
lematic; category of the problem), which will allow us to evaluate the 
labeling system and the quality of the input data, and propose adjust-
ments to the resources and the game if necessary. These annotated 
sentences will comprise manually annotated pedagogical corpora, and 
will be available as part of the CLARIN Language Resources Family. They 
will also be fed into the game to be used for scoring mechanism devel-
opment, such as the scores given by comparison with other players and 
asynchronous play, for implementation of the dual-player mode, as pre-
recorded answers for a bot, and as input data for the game.
When the game is launched, additional data will be required as in-
put. The extraction of this data will follow a slightly different approach, 
given that we want the crowd to label as many sentences from the 
source corpora as possible. With the source corpora, pedagogically ori-
ented GDEX configurations, and tested labeling system and gameplay, 
data input for the game will be extracted as follows. First, we will GDEX 
the corpus, i.e., run the GDEX configuration to assign GDEX scores to 
all sentences in the corpus. We will then extract sentences in batches, 
with varying GDEX scores, i.e., a certain number of sentences with the 
highest scores, medium scores and low scores. These sentences will 
be input into the game for players to play. Once the game is tested with 
actual players, an evaluation of the methodology of data preparation 
can be carried out.
6 Analysis and discussion
One of the main aspects that might have an impact on the results of the 
initial test with annotation of 10,000 sentences is that the resources 
89
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
that were used for data preparation present different levels of develop-
ment. While Estonian and Slovene use source corpora that have been 
carefully compiled in the context of other projects, with rich metadata 
and advanced annotation, Dutch and Portuguese use automatically 
compiled web corpora with no human curation and POS-tagged by the 
Sketch Engine team. It should be acknowledged that these differences 
in the development of the resources might influence the quality of the 
input data (extracted sentences), with consequent reflection on the 
quality of the output data (annotated sentences).
Preparing the common lemma list posed many challenges, becom-
ing an iterative process in which English words were proposed, trans-
lated to the target languages and then – based on the suitability of the 
translation equivalents – accepted or replaced. A discussion was needed 
if for one or more target languages a translation equivalent was not suit-
able from the perspective of form, meaning, connotation or frequency.
To ease the data extraction, we aimed for a list of single-word lem-
mata for all target languages. We thus avoided English prompts that 
would require multiword translations. For example, for the English verb 
to fuck off not all languages had single-word translations (Slovene: 
odjebati, Estonian: perse käima, Portuguese: ir se foder, Dutch: opso-
demieteren), therefore we replaced it with the verb to fuck (Slovene: 
jebati, Estonian: keppima, Portuguese: foder, Dutch: neuken). More 
permissive were our decisions when it came to the part-of-speech of 
the translation equivalents. For most of the cases, providing transla-
tion equivalents of the same POS was unproblematic. In rare instances 
where the POS of otherwise the most suitable translation candidate 
did not match, we kept it on the list. For example, some English adjec-
tives in Estonian are actually case forms of a noun, e.g. depressioonis 
‘in depression’ (not ‘depressed’). When examining the occurrences of 
the lemmata in the source corpora, we also noticed that some POS dif-
ferences stemmed from the features of the taggers used to annotate 
the data (e.g., the Portuguese equivalent retardado for the English ad-
jective retarded occurs erroneously tagged as verbs (participle) in the 
Portuguese corpus). While such problems would have to be considered 
when extracting the data, they did not influence the selection of the 
candidates for the common lemma list.
90
Slovenščina 2.0, 2022 (2) | Articles
Important for the list was the connotation of the translation equiv-
alents. When the target language did not have a translation equivalent 
with comparable sensitivity, the English word was replaced. For exam-
ple, the English noun bimbo for an ‘attractive but unintelligent or frivo-
lous young woman’ did not have a suitable single-word translation in 
Portuguese, so we replaced it with a (more offensive) slut (Slovene: 
cipa, Estonian: libu, Portuguese: vagabunda, Dutch: slet). Other se-
mantic differences, such as nuances in the meaning(s) of the translat-
ed words were accepted, as we did not want to create a list that would 
be overly curated, artificial, and methodologically difficult to expand 
with further lemmata and to other languages. In situations where more 
semantically suitable translation equivalents were possible, we opted 
for the one that was less polysemic (for example, for the English noun 
corpse, we chose the Portuguese cadáver and not corpo which has a 
wider use).
Finally, the translation equivalents were checked for their frequen-
cy in the corresponding source corpora. According to our methodology, 
we needed at least 100 heterogenous corpus examples per lemma, 
but to have enough data to select from we aimed to extract 200. Espe-
cially in “cleaner” corpora, such as the Slovene source corpus Gigafi-
da, the offensive and vulgar words were rare, but nearly all proposed 
lemmas had over 200 occurrences. We decided to keep the noun as-
shole with a Slovene translation pezde (198 occurrences in the Slovene 
source corpus) and replace the adjective transsexual (less than 10 oc-
currences in the Dutch source corpus) with a more frequently occurring 
transgender. 
Once the game is fully operational, a series of issues need to be 
considered. For example, it is important to ensure the rapid implemen-
tation of the game’s results into practice. This requires both a set of 
clear parameters on what a minimum number – as well as a maximum 
number – of user responses per example is, what level of agreement 
is required, etc., as well as automatic tools or algorithms for regular 
data analysis and summarization. All this helps to increase the quantity 
of crowdsourced data, as more examples can be added to the game 
(and at the same time the sufficiently examined ones removed) on a 
regular basis. Technical aspects should also be paid enough attention, 
91
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
meaning the server should have enough capacity and storage space 
to cater for heavy usage, which can partly be addressed by conducting 
rigorous stress tests before the launch of the game. Last but not least, 
a detailed promotion plan needs to be prepared in advance, including 
the steps on how to not only attract users, but also keep them long 
term.
7 Conclusions
In this paper, we proposed a methodology of data preparation for the 
development of the Crowdsourcing for Language Learning (CrowLL) 
game, from which data will be collected through crowdsourcing to cre-
ate problem-labeled pedagogical corpora for Dutch, Estonian, Slovene, 
and Brazilian Portuguese. For this process a series of decisions had to 
be made, from the choice of source corpora, to GDEX configuration de-
velopment and lemma list creation. By describing the methodology and 
reflecting on the challenges posed and solutions found, it is our intention 
to provide researchers sharing common interests with a model that can 
be applied to other languages, and potentially to other purposes. 
The next steps of our project involve the extraction of sentences 
for the game, full implementation of the game, collection of answers 
(from actual players), statistical analysis of labeled data, and design 
and administration of a user survey to evaluate the game design and 
user experience. With the players’ answers, we will compile problem-
annotated corpora and develop other auxiliary language learning re-
sources, such as SKELL for all the languages. After that, we plan to start 
the third stage of the project, in which we will use the problem-labeled 
corpora to create the basis for the future development of machine-
learning training models to automatize identification and labeling of 
problematic content, thus contributing to the further and faster crea-
tion of pedagogical corpora.
Acknowledgments 
The authors acknowledge the financial support received from the Portu-
guese national funding agency, FCT – Foundation for Science and Technology, 
I.P. (grant number UIDP/04887/2020) and the Slovenian Research Agency 
92
Slovenščina 2.0, 2022 (2) | Articles
(research core funding No. P6-0411, Language Resources and Technologies 
for Slovene, and project funding No. J7-3159, Empirical foundations for digi-
tally-supported development of writing skills). The research received funding 
from the European Union’s Horizon 2020 research and innovation program 
under grant agreement No. 731015. This study has also been supported by 
the CLARIN Resource Families Project Funding.
References
Aitamurto, T., Leiponen, A., & Tee, R. (2011). The promise of idea crowdsourc-
ing–benefits, contexts, limitations [White paper]. Nokia Ideas project.
Arhar Holdt, Š., Kosem, I., & Gantar, P. (2017). Corpus-based resourc-
es for L1 teaching: The case of Slovene. In Handbook on digi-
tal learning for K-12 schools (pp. 91–113). Springer, Cham. doi: 
10.1007/978-3-319-33808-8_7 
Arhar Holdt, Š., Kosem, I., Krapš Vodopivec, I., Ledinek, N., Može, S., Stritar 
Kučuk, M., Svenšek, T., & Zwitter Vitez, A. (2011). Pedagoška slovnica pri 
projektu Sporazumevanje v slovenskem jeziku: K16 – Standard za korpus-
no analizo slovničnih pojavov. Ljubljana: Ministrstvo za šolstvo in šport: 
Amebis. Retrieved from http://projekt.slovenscina.eu/Media/Kazalniki/
Kazalnik16/Kazalnik_16_Pedagoska_slovnica_SSJ.pdf 
Arhar Holdt, Š., Logar, N., Pori, E., & Kosem, I. (2021). “Game of Words”: 
Play the game, clean the database. In Z. Gavriilidou, M. Mitsiaki & A. 
Fliatouras (Eds.), Proceedings of the EURALEX XIX congress: Lexicog-
raphy for inclusion, 7–11 September, Aleksandroupolis, Greece (Vol I., 
pp. 41–49). Retrieved from https://www.euralex.org/elx_proceedings/
Euralex2020-2021/EURALEX2020-2021_Vol1-p041-049.pdf
Baisa, V., & Suchomel, V. (2014). SkELL: Web interface for English language 
learning. Proceedings of the eighth workshop on recent advances in Sla-
vonic natural language processing, RASLAN 2014 (pp. 63–70). Retrieved 
from https://nlp.fi.muni.cz/raslan/2014/12.pdf
Bassignana, E., Basile, V., & Patti, V. (2018). Hurtlex: A multilingual lexicon of 
words to hurt. CEUR Workshop proceedings, 1–6. Retrieved from http://
ceur-ws.org/Vol-2253/paper49.pdf 
Bédi, B., Chua, C., Habibi, H., Martinez-Lopez, R., & Rayner, M. (2019). Using 
LARA for language learning: a pilot study for Icelandic. In F. Meunier, J. 
van de Vyver, L. Bradley & S. Thouësny (Eds.), CALL and complexity: short 
papers from EUROCALL 2019 (pp. 33–38). Research-publishing.net. doi: 
10.14705/rpnet.2019.38.982
93
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
Bonetti, F., & Tonelli. S. (2020). A 3D role-playing game for abusive language 
annotation. Workshop on games and natural language processing (pp. 
39–43). Retrieved from https://aclanthology.org/2020.gamnlp-1.6 
Bonetti, F., & Tonelli. S. (2021). Challenges in designing games with a pur-
pose for abusive language annotation. Proceedings of the first workshop 
on bridging human–computer interaction and natural language process-
ing (pp. 60–65). https://aclanthology.org/2021.hcinlp-1.10 
Boulton, A. (2017). Corpora in language teaching and learning: Re-
search timeline. Language Teaching, 50(4), 483–506. doi: 10.1017/
S0261444817000167
Braun, S. (2005). From pedagogically relevant corpora to authentic lan-
guage learning contents. ReCALL, 17(1), 47–64. doi: 10.1017/
S0958344005000510 
Buecheler, T., Sieg, J. H., Füchslin, R. M., & Pfeifer, R. (2010). Crowdsourc-
ing, open innovation and collective intelligence in the scientific method: 
a research agenda and operational framework. In H. Fellermann, M. Dörr, 
M. Hanczyc, L. L. Laursen, S. Maurer, D. Merkle, P-A. Monnard, K. Stoy, S. 
Rasmussen (Eds.), Artificial live XII: proceedings of the twelfth interna-
tional conference on the synthesis and simulation of living systems (pp. 
679–686). MIT Press. doi: 10.21256/zhaw-4094
Callies, M. (2019). Integrating corpus literacy into language teacher educa-
tion. In S. Götz, J. Mukherjee (Eds.), Learner corpora and language teach-
ing (pp. 245–263). John Benjamins Publishing Company. doi: 10.1075/
scl.92.12cal
Chamberlain, J., Poesio, M., & Kruschwitz, U. (2008). Phrase detectives: A 
web-based collaborative annotation game. Proceedings of the interna-
tional conference on semantic systems (I-Semantics’08) (pp. 42–49). Re-
trieved from https://www.jonchamberlain.com/media/doc/Chamberlain-
2008Phrase.pdf 
Chambers, A. (2016). Written language corpora and pedagogic applications. In 
F. Farr, L. Murray (Eds.), The Routledge handbook of language learning and 
technology (pp. 362–375). Routledge. doi: 10.4324/9781315657899.ch26 
Chesbrough, H. W. (2006). Open innovation: The new imperative for creating 
and profiting from technology. Harvard Business School Press.
Colman, L., & Tiberius C. (2018). A good match: A Dutch collocation, idiom 
and pattern dictionary combined. Proceedings of the XVIII EURALEX in-
ternational congress: Lexicography in global contexts (pp. 233–246). Re-
trieved from https://euralex.org/wp-content/themes/euralex/proceed-
ings/Euralex%202018/118-4-2952-1-10-20180820.pdf 
94
Slovenščina 2.0, 2022 (2) | Articles
Erelt, M., & Metslang, H. (2017). Eesti keele süntaks. Eesti keele vara-
mu III. Tartu Ülikooli Kirjastus. Retrieved from https://dspace.ut.ee/
handle/10062/70510
Eryiğit, G., Şentaş, A., & Monti, J. (2022). Gamified crowdsourcing for idiom 
corpora construction. Natural Language Engineering (pp. 1–33). doi: 
10.1017/S1351324921000401 
Gantar, P., Kosem, I., & Krek, S. (2016). Discovering automated lexicography: 
The case of the Slovene lexical database. International Journal of Lexi-
cography, 29(2), 200–225. doi: 10.1093/ijl/ecw014
Gorovaia, N. (2018). Behavior of users on the crowdsourcing platforms. [Post-
er session]. EnetCollect WG3/WG5 meeting, October 24–25, Leiden, 
Netherlands. 
Gries, S. (2009). What is corpus linguistics? Language and Linguistics Com-
pass, 3, 1–17. doi: 10.1111/j.1749-818X.2009.00149.x 
Guillaume, B., Fort, K., & Lefebvre, N. (2016). Crowdsourcing complex lan-
guage resources: Playing to annotate dependency syntax. Proceedings 
of COLING 2016, the 26th international conference on computational lin-
guistics: Technical papers (pp. 3041–3052). Retrieved from https://aclan-
thology.org/C16-1286 
Hacker, S., & von Ahn, L. (2009). Matchin: eliciting user preferences with an 
online game. Proceedings of the SIGCHI Conference on Human Factors in 
Computing Systems (pp. 1207–1216). doi: 10.1145/1518701.1518882 
Harris, C.G. (2014). The beauty contest revisited: measuring consensus rank-
ings of relevance using a game. Proceedings of the first international work-
shop on gamification for information retrieval – GamifIR@ECIR ‘14 (pp. 
17–21). doi: 10.1145/2594776.2594780
Kallas, J., Kilgarriff, A., Koppel, K., Kudritski, E., Langemets, M., Michelfeit, 
J., Tuulik, M., & Viks, Ü. (2015). Automatic generation of the Estonian 
Collocations Dictionary database. Proceedings of the eLex 2015 confer-
ence (pp. 1−20). Retrieved from https://elex.link/elex2015/proceedings/
eLex_2015_01_Kallas+etal.pdf
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, 
P., & Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography, 
1(1), 7–36. doi: 10.1007/s40607-014-0009-9 
Kilgarriff, A., Husák, M., McAdam, K., Rundell, M., & Rychlý, P. (2008). GDEX: 
Automatically finding good dictionary examples in a corpus. Proceedings 
of the XIII EURALEX international congress (Vol. 1, pp. 425–432). https://
tinyurl.com/yckr9w8s
95
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
Kilgarriff, A., Rychlý, P., Smrz, P., & D. Tugwell (2004). The Sketch Engine. Pro-
ceedings of the eleventh EURALEX international congress, EURALEX 2004 
(pp. 105–116). Retrieved from https://tinyurl.com/mvrp4ymy 
Koppel, K. (2019). Leksikograafide ja keeleõppijate hinnangud automaat-
selt tuvastatud korpuslausete sobivusele õppesõnastiku näitelauseks. 
Lähivõrdlusi. Lähivertailuja, 29, 84−112. doi: 10.5128/LV29.03 
Koppel, K. (2020). Näitelausete korpuspõhine automaattuvastus eesti keele 
õppesõnastikele. Doktoritöö, Tartu Ülikool. Retrieved from https://dspace.
ut.ee/handle/10062/67138 
Koppel, K., & Kallas, J. (2022). Eesti keele ühendkorpus 2021. doi: 
10.15155/3-00-0000-0000-0000-08D17L 
Koppel, K., Kallas, J., Khokhlova, M., Suchomel, V., Baisa, V., & Michelfeit, J. 
(2019). SkELL corpora as a part of the language portal Sõnaveeb: problems 
and perspectives. Proceedings of the eLex 2019 conference (pp. 763−782). 
Retrieved from https://zenodo.org/record/3612933#.Yywd1XZBy70
Koppel, K., Tavast, A., Langemets, M., & Kallas, J. (2019). Aggregating dictionar-
ies into the language portal Sõnaveeb: issues with and without a solution. 
Proceedings of the eLex 2019 conference (pp. 434−452). Retrieved from htt-
ps://elex.link/elex2019/wp-content/uploads/2019/09/eLex_2019_24.pdf
Kosem, I. (2012,). Using GDEX in (semi)-automatic creation of database en-
tries [Conference presentation]. SKEW-3, 3rd international Sketch Engine 
workshop, 21−22 March, 2012. 
Kosem, I., Gantar, P., & Krek, S. (2013). Automation of lexicographic work: 
an opportunity for both lexicographers and crowd-sourcing. Proceedings 
of the eLex 2013 conference (pp. 32−48). Retrieved from http://eki.ee/
elex2013/proceedings/eLex2013_03_Kosem+Gantar+Krek.pdf 
Kosem, I., Husák, M., & McCarthy, D. (2011). GDEX for Slovene. Proceedings 
of eLex 2011 (pp. 151–159). Retrieved from http://www.dianamccarthy.
co.uk/files/Kosemetal-paper.pdf 
Kosem, I., Koppel, K., Kuhn, T. Z., Michelfeit, J., & Tiberius, C. (2019). Identifi-
cation and automatic extraction of good dictionary examples: the case(s) 
of GDEX. International Journal of Lexicography, 32(2), 119−137. doi: 
10.1093/ijl/ecy014 
Krek, S., Arhar Holdt, Š., Erjavec, T., Čibej, J., Repar, A., Gantar, P., Ljubešić, N., 
Kosem, I., & Dobrovoljc, K. (2020). Gigafida 2.0: The reference corpus of 
written standard Slovene. Proceedings of the twelfth language resources 
and evaluation conference (pp. 3340–3345). Retrieved from https://ac-
lanthology.org/2020.lrec-1.409 
96
Slovenščina 2.0, 2022 (2) | Articles
Kuhn, T. Z. (2017). A design proposal of an online corpus-driven dictionary of 
Portuguese for university students [Doctoral dissertation, Universidade de 
Lisboa]. Retrieved from http://hdl.handle.net/10451/32013
Kuhn, T. Z., Šandrih Todorović, B., Holdt, Š. A., Zviel-Girshin, R., Koppel, K., Luís, 
A.R., & Kosem, I. (2021). Crowdsourcing pedagogical corpora for lexico-
graphical purposes. Proceedings of the XIX EURALEX congress: Lexicog-
raphy for inclusion (Vol. II., pp. 771–779). Retrieved from https://www.
euralex.org/elx_proceedings/Euralex2020-2021/EURALEX2020-2021_
Vol2-p771-779.pdf 
Lafourcade, M. (2007). Making people play for Lexical Acquisition with the 
JeuxDeMots prototype. Proceedings of SNLP’07: 7th international sym-
posium on natural language processing. Retrieved from https://hal-lirmm.
ccsd.cnrs.fr/lirmm-00200883 
Langemets, M., Hein, I., Jürviste, M., Kallas, J., Kiisla, O., Koppel, K., 
Leemets, T., …, & Tubin, V. (2022). EKI ühendsõnastik 2022. doi: 
10.15155/3-00-0000-0000-0000-08C0AL 
Lévy, P. (1997). Collective intelligence: Mankind’s emerging world in cyber-
space. Plenum Trade. New York.
Lew, R. (2014). User-generated content (UGC) in online English dictionaries. 
OPAL, 4, 8–26. Retrieved from https://pub.ids-mannheim.de//laufend/
opal/opal14-4.html 
Lyding, V., Nicolas, L., Bédi, B., & Fort, K. (2018). Introducing the European 
network for combining language learning and crowdsourcing techniques 
(enetcollect). In P. Taalas, J. Jalkanen, L. Bradley & S. Thouësny (Eds.), 
Future-proof CALL: language learning as exploration and encounters–
short papers from EUROCALL (pp. 176–181). Research-publishing.net. 
doi: 10.14705/rpnet.2018.26.833 
Morschheuser, B., Hamari, J., Koivisto, J., & Maedche, A. (2017). Gamified 
crowdsourcing: Conceptualization, literature review, and future agenda. 
International Journal of Human-Computer Studies, 106, 26–43. doi: 
10.1016/j.ijhcs.2017.04.005 
Nicolas, L., Lyding, V., Borg, C., Forăscu, C., Fort, K., Zdravkova, K., Kosem, 
I., …, & HaCohen-Kerner, Y. (2020). Creating expert knowledge by relying 
on language learners: a generic approach for mass-producing language 
resources by combining implicit crowdsourcing and language learning. 
Proceedings of the 12th language resources and evaluation conference 
(pp. 268–278). Retrieved from https://aclanthology.org/2020.lrec-1.34 
Osborne, J. (2004). Top-down and bottom-up approaches to corpora in lan-
guage teaching. language and computers. In U. Connor, T. A. Upton (Eds.), 
97
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
Applied Corpus Linguistics. A Multidimensional Perspective (pp. 251–
265). Brill. doi: 10.1163/9789004333772_015 
Pe-Than, E. P. P., Goh, D. H. L., & Lee, C. S. (2015). A typology of hu-
man computation games: an analysis and a review of current 
games. Behaviour & Information Technology, 34(8), 809–824. doi: 
10.1080/0144929X.2013.862304 
Pilán, I., Vajjala, S., & Volodina, E. (2016). A readable read: Automatic assess-
ment of language learning materials based on linguistic complexity. ArXiv. 
doi: 10.48550/arXiv.1603.08868 
Pilán, I., Volodina, E., & Johansson, R. (2013). Automatic selection of suit-
able sentences for language learning exercises. 20 Years of EUROCALL: 
Learning from the past, looking to the future: 2013 EUROCALL Conference 
Proceedings (pp. 218–225). Retrieved from https://aclanthology.org/
W14-1821.pdf 
Pilán, I., Volodina, E., & Johansson, R. (2014). Rule-based and machine learn-
ing approaches for second language sentence-level readability. Proceed-
ings of the ninth workshop on innovative use of NLP for building educa-
tional applications (pp. 174–184). Retrieved from https://aclanthology.
org/W14-1821 
Poletto, F., Basile, V., Sanguinetti, M., Bosco, C., & Patti, V. (2021). Resourc-
es and benchmark corpora for hate speech detection: a systematic re-
view. Language Resources & Evaluation, 55(2), 477–523. doi: 10.1007/
s10579-020-09502-8 
Prahalad, C. K., & Ramaswamy, V. (2000). Co-opting customer competence. 
Harvard Business Review. Retrieved from https://hbr.org/2000/01/
co-opting-customer-competence 
Preist, C., Massung, E., & Coyle, D. (2014). Competing or aiming to be av-
erage? Normification as a means of engaging digital volunteers. Pro-
ceedings of the 17th ACM conference on computer supported coop-
erative work & social computing (CSCW ‘14) (pp. 1222–1233). doi: 
10.1145/2531602.2531615 
Reynaert, M. (2006). Corpus-induced corpus clean-up. Proceedings of the 
fifth international conference on language resources and evaluation (pp. 
87–92). Retrieved from http://www.lrec-conf.org/proceedings/lrec2006/
pdf/229_pdf.pdf 
Römer, U. (2009). Using general and specialised corpora in language teach-
ing: Past, present and future. In M. C. Campoy, B. Belles-Fortuno & M. L. 
Gea-Valor (Eds.), Corpus-based approaches to English language teaching 
(pp.18–35). Continuum Publishing Corporation. 
98
Slovenščina 2.0, 2022 (2) | Articles
Šandrih Todorović, B. (2020). Impact of text classification on natural language 
processing applications. [Универзитет у Београду].
Schmidt, A., & Wiegand, M. (2017). A survey on hate speech detection using 
natural language processing. Proceedings of the fifth international work-
shop on natural language processing for social media (pp. 1–10). doi: 
10.18653/v1/W17-1101
Seemakurty, N., Chu, J., von Ahn, L., & Tomasic, A. (2010). Word sense disam-
biguation via human computation. Proceedings of the ACM SIGKDD work-
shop on human computation (pp. 60–63). doi: 10.1145/1837885.1837905 
Simpson, R., Page, K. R., & De Roure, D. (2014). Zooniverse: observ-
ing the world’s largest citizen science platform. Proceedings of the 
23rd international conference on world wide web, 1049–1054. doi: 
10.1145/2567948.2579215 
Sinclair, J. (2005). Corpus and text - basic principles. In M. Wynne (Ed.), De-
veloping linguistic corpora: A guide to good practice (pp. 1–16). Oxbow 
Books. Retrieved from https://users.ox.ac.uk/~martinw/dlc/chapter1.htm 
Stanković, R., Šandrih, B., Stijović, R., Krstev, C., Vitas, D., & Marković, A. (2019). 
SASA dictionary as the gold standard for good dictionary examples for 
Serbian. Proceedings of the eLex 2019 conference (pp. 248–269). Re-
trieved from https://elex.link/elex2019/wp-content/uploads/2019/09/
eLex_2019_14.pdf 
Trampuš, M., & Novak, B. (2012). The internals of an aggregated web news 
feed. Proceedings of 15th multiconference on information society 2012 
(IS-2012). Retrieved from http://ailab.ijs.si/dunja/SiKDD2012/Papers/
Trampus_Newsfeed.pdf
Vajjala, S. (2022). Trends, limitations and open challenges in automatic read-
ability assessment research. Proceedings of the thirteenth language re-
sources and evaluation conference (pp. 5366–5377). Retrieved from 
https://aclanthology.org/2022.lrec-1.574 
Vidgen, B., & Derczynski, L. (2020). Directions in abusive language training 
data, a systematic review: Garbage in, garbage out. PLoS ONE, 15(12): 
e0243300. doi: 10.1371/journal.pone.0243300 
von Ahn, L. (2006). Games with a purpose. Computer, 39(6), 92–94. Retrieved 
from https://www.cs.cmu.edu/~biglou/ieee-gwap.pdf 
von Ahn, L., & Dabbish, L. (2008). Designing games with a purpose. Communi-
cations of the ACM, 51(8), 58–67. doi: 10.1145/1378704.1378719 
Von Hippel, E., & Katz, R. (2002). Shifting innovation to users via toolkits. Man-
agement science, 48(7), 821–833.
99
Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game
Vyatkina, N., & Boulton, A. (2017). Corpora in language teaching and learning. 
Language Learning and Technology, 21(3), 1–8.
Xu, L., & Chamberlain, J. (2020). Cipher: a prototype game-with-a-purpose for 
detecting errors in text. Workshop games and natural language processing 
(pp. 17–25). Retrieved from https://aclanthology.org/2020.gamnlp-1.3 
Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., & Kumar, R. 
(2019). SemEval-2019 task 6: Identifying and categorizing offensive lan-
guage in social media (OffensEval). Proceedings of the 13th internation-
al workshop on semantic evaluation (SemEval-2019) (pp. 75–86). doi: 
10.18653/v1/S19-2010
Zampieri, M., Nakov, P., Rosenthal, S., Atanasova, P., Karadzhov, G., Mubarak, 
H., Derczynski, L., Pitenis, Z., & Çöltekin, C. (2020). SemEval-2020 task 
12: Multilingual offensive language identification in social media (Offen-
sEval 2020). Proceedings of the 14th international workshop on semantic 
evaluation. Retrieved from https://arxiv.org/abs/2006.07235 
Zviel-Girshin, R., Kuhn, T. Z., Luís, A. R., Koppel, K., Šandrih Todorović, B., 
Holdt, Š. A., Tiberius, C., & Kosem, I. (2021). Developing pedagogically 
appropriate language corpora through crowdsourcing and gamification. 
In N. Zoghlami, C. Brudermann, C. Sarré, M. Grosbois, L. Bradley, & S. 
Thouësny (Eds), CALL and professionalisation: short papers from EURO-
CALL 2021 (pp. 312–317). doi: 10.14705/rpnet.2021.54.1352
Priprava podatkov pri množičenju v pedagoške namene:  
primer igre CrowLL
Eden od načinov za spodbujanje uporabe korpusov pri jezikovnem izobraže-
vanju je izdelava pedagoško primernih korpusov, označenih z različnimi vr-
stami problematik (občutljiva vsebina, žaljiv jezik, strukturne težave). Ker je 
ročno označevanje korpusov zelo časovno potratno, je potrebno poiskati boljši 
pristop. Predlagamo kombinacijo dveh pristopov k oblikovanju problemsko 
označenih pedagoških korpusov nizozemščine, estonščine, slovenščine in 
brazilske portugalščine: uporabo iger z namenom množičenja. Z udeleženci 
smo izvedli začetne poskuse, da bi ugotovili, če je naloga množičenja ustre-
zna, pridobljene izkušnje pa smo uporabili za oblikovanje igre Crowdsourcing 
for Language Learning (CrowLL), v kateri igralci prepoznavajo problematične 
povedi in segmente ter jih razvrščajo. V prispevku se osredotočamo na pripra-
vo podatkov, saj ima ta korak ključni pomen pri vsakem projektu množičenja, 
ki obravnava ustvarjanje jezikovnih učnih virov. Predlagamo metodologijo za 
100
Slovenščina 2.0, 2022 (2) | Articles
pripravo podatkov, podrobno predstavljamo izbiro izvornih korpusov, pedago-
ško usmerjene konfiguracije GDEX in oblikovanje seznamov lem, s posebnim 
poudarkom na pogostih in od jezika odvisnih odločitvah. Za konec ponujamo 
razpravo o izzivih, ki smo jih zasledili, in o rešitvah, ki smo jih do sedaj že uvedli.
Ključne besede: množičenje, igra z namenom, vzorčni stavki, pedagoški 
korpus