vi 1
M. ROBNIK-ŠIKONJA, K. REBA, I. MOZETIČ: Cross-lingual transfer of sentiment classifiers
CROSS-LINGUAL TRANSFER OF  
SENTIMENT CLASSIFIERS
Marko ROBNIK-ŠIKONJA
Faculty of Computer and Information Science, University of Ljubljana
Kristjan REBA
Faculty of Computer and Information Science, University of Ljubljana
Igor MOZETIČ
Jožef Stefan Institute
Robnik-Šikonja, M., Reba, K., Mozetič, I. (2021): Cross-lingual transfer of sentiment 
classifiers. Slovenščina 2.0, 9(1): 1–25. 
DOI: https://doi.org/10.4312/slo2.0.2021.1.1- 25
Word embeddings represent words in a numeric space so that semantic relations 
between words are represented as distances and directions in the vector space. 
Cross-lingual word embeddings transform vector spaces of different languages so 
that similar words are aligned. This is done by mapping one language’s vector space 
to the vector space of another language or by construction of a joint vector space 
for multiple languages. Cross-lingual embeddings can be used to transfer machine 
learning models between languages, thereby compensating for insufficient data 
in less-resourced languages. We use cross-lingual word embeddings to transfer 
machine learning prediction models for Twitter sentiment between 13 languages. 
We focus on two transfer mechanisms that recently show superior transfer perfor -
mance. The first mechanism uses the trained models whose input is the joint nu -
merical space for many languages as implemented in the LASER library. The second 
mechanism uses large pretrained multilingual BERT language models. Our experi -
ments show that the transfer of models between similar languages is sensible, even 
with no target language data. The performance of cross-lingual models obtained 
with the multilingual BERT and LASER library is comparable, and the differences 
are language-dependent. The transfer with CroSloEngual BERT, pretrained on only 
three languages, is superior on these and some closely related languages.
Keywords: natural language processing, machine learning, text embeddings, senti -
ment analysis, BERT models
Slovenscina_2_2021_1 korekture3.indd   1 Slovenscina_2_2021_1 korekture3.indd   1 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
2 3
Slovenščina 2.0, 2021 (1)
1 INTRODUCTION
Word embeddings are representations of words in numerical form, as vectors 
of typically several hundred dimensions. The vectors are used as input to ma -
chine learning models; for complex language processing tasks, these generally 
are deep neural networks. The embedding vectors are obtained from special -
ised neural network-based embedding algorithms, e.g., fastText (Bojanowski 
et al., 2017) for morphologically-rich languages. Word embedding spaces ex -
hibit similar structures across languages, even when considering distant lan -
guage pairs like English and Vietnamese (Mikolov et al., 2013). This means 
that embeddings independently produced from monolingual text resources 
can be aligned, resulting in a common cross-lingual representation, called 
cross-lingual embeddings, which allows for fast and effective integration of 
information in different languages.
There exist several approaches to cross-lingual embeddings. The first group 
of approaches uses monolingual embeddings with an optional help from a 
bilingual dictionary to align the pairs of embeddings (Artetxe et al., 2018a). 
The second group of approaches uses bilingually aligned (comparable or even 
parallel) corpora to construct joint embeddings (Artetxe and Schwenk, 2019). 
This approach is implemented in the LASER library
1
 and is available for 93 
languages. The third type of approaches is based on large pretrained multilin -
gual masked language models such as BERT (Devlin et al., 2019). In this work, 
we focus on the second and third group of approaches. In particular, from the 
third group, we apply two variants of BERT models, the original multilingual 
BERT model (mBERT), trained on 104 languages, and trilingual CroSloEn -
gual BERT (Ulčar and Robnik-Šikonja, 2020) trained on Croatian, Slovene, 
and English (CSE BERT). 
Sentiment annotation is a costly and lengthy operation, with a relatively low 
inter-annotator agreement (Mozetič et al., 2016). Large annotated sentiment 
datasets are, therefore, rare, especially for low-resourced languages. The 
transfer of already trained models or datasets from other languages would 
increase the ability to study sentiment-related phenomena for many more lan -
guages than possible today.
1 https://github.com/facebookresearch/LASER 
Slovenscina_2_2021_1 korekture3.indd   2 Slovenscina_2_2021_1 korekture3.indd   2 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
2 3
M. ROBNIK-ŠIKONJA, K. REBA, I. MOZETIČ: Cross-lingual transfer of sentiment classifiers
Our study aims to analyse the abilities of modern cross-lingual approaches for 
the transfer of trained models between languages. We study two cross-lingual 
transfer technologies, using a joint vector space computed from parallel cor -
pora with the LASER library and multilingual BERT models. The advantage 
of our study is sizeable comparable classification datasets in 13 different lan -
guages, which gives credibility and general validity to our findings. Further, 
due to the datasets’ size, we can reliably test different transfer modes: direct 
transfer between languages (called a zero-shot transfer) and transfer with 
enough fine-tuning data in the target language. In the experiments, we study 
two cross-lingual transfer modes based on projections of sentences into a joint 
vector space. The first mode transfers trained models from source to target 
languages. A model is trained on the source language(s) and used for classifi -
cation in the target language(s). This model transfer is possible because texts 
in all processed languages are embedded into the common vector space. The 
second mode expands the training set with instances from other languages, 
and then all instances are mapped into the common vector space during neu -
ral network training. Besides the cross-lingual transfer, we analyse the quality 
of representations for the Twitter sentiment classification and compare the 
common vector space for several languages constructed by the LASER li -
brary, multilingual BERT models, and the traditional bag-of-words approach. 
The results show a relatively low decrease in predictive performance when 
transferring trained sentiment prediction models between similar languages 
and superior performance of multilingual BERT models covering only three 
languages.
The paper is divided into four more sections. In Section 2, we present back -
ground on different types of cross-lingual embeddings: alignment of mono -
lingual embeddings, building a common explicit vector space for several lan -
guages, and large pretrained multilingual contextual models. We also discuss 
related work on Twitter sentiment analysis and cross-lingual transfer of clas -
sification models. In Section 3, we present a large collection of tweets from 
13 languages used in our empirical evaluation, the implementation details 
of our deep neural network prediction models, and the evaluation metrics 
used. Section 4 contains four series of experiments. We first evaluate differ -
ent representation spaces and compare the LASER common vector space with 
Slovenscina_2_2021_1 korekture3.indd   3 Slovenscina_2_2021_1 korekture3.indd   3 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
4 5
Slovenščina 2.0, 2021 (1)
multilingual BERT models and convential bag-of-ngrams. We then analyse 
the transfer of trained models between languages from the same language 
group and from a different language group, followed by expanding datasets 
with instances from other languages. In Section 5, we summarise the results 
and present ideas for further work.
2 BACKGROUND AND RELATED WORK
Word embeddings represent each word in a language as a vector in a high 
dimensional vector space so that the relations between words in a language 
are reflected in their corresponding embeddings. Cross-lingual embeddings 
attempt to map words represented as vectors from one vector space to an -
other so that the vectors representing words with the same meaning in both 
languages are as close as possible. Søgaard et al. (2019) present a detailed 
overview and classification of cross-lingual methods.
Cross-lingual approaches can be sorted into three groups, described in the 
following three subsections. The first group of methods uses monolingual 
embeddings with (an optional) help from bilingual dictionaries to align the 
embeddings. The second group of approaches uses bilingually aligned (com -
parable or even parallel) corpora for joint construction of embeddings in all 
handled languages. The third type of approaches is based on large pretrained 
multilingual masked language models such as BERT (Devlin et al., 2019). In 
contrast to the first two types of approaches, the multilingual BERT models 
are typically used as starting models, which are fine-tuned for a particular task 
without explicitly extracting embedding vectors.
In Section 2.1, we first present background information on the alignment of 
individual monolingual embeddings. We describe the projections of many 
languages into a joint vector space in Section 2.2, and in Section 2.3, we pres -
ent variants of multilingual BERT models. In Section 2.4, we describe related 
work on Twitter sentiment classification. Finally, in Section 2.5, we outline the 
related work on cross-lingual transfer of classification models. 
2.1 Alignment of monolingual embeddings
Cross-lingual alignment methods take precomputed word embeddings for 
each language and align them with the optional use of bilingual dictionaries. 
Slovenscina_2_2021_1 korekture3.indd   4 Slovenscina_2_2021_1 korekture3.indd   4 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
4 5
M. ROBNIK-ŠIKONJA, K. REBA, I. MOZETIČ: Cross-lingual transfer of sentiment classifiers
Two types of monolingual embedding alignment methods exist. The first 
type of approaches map vectors representing words in one of the languages 
into the vector space of the other language (and vice-versa). The second type 
of approaches maps embeddings from both languages into a joint vector 
space. The goal of both types of alignments is the same: the embeddings for 
words with the same meaning must be as close as possible in the final vector 
space. A comprehensive summary of existing approaches can be found in 
(Artetxe et al., 2018a). The open-source vecmap
2
 library contains imple -
mentations of methods described in (Artetxe et al., 2018a), and can align 
monolingual embeddings using a supervised, semi-supervised, or unsuper -
vised approach.
The supervised approach requires the use of a bilingual dictionary, which is 
used to match embeddings of equivalent words. The embeddings are aligned 
using the Moore-Penrose pseudo-inverse, which minimises the sum of squared 
Euclidean distances. The algorithm always converges but can be caught in a 
local maximum. Several methods (e.g., stochastic dictionary introduction 
or frequency-based vocabulary cut-off) are used to help the algorithm climb 
out of local maxima. A more detailed description of the algorithm is given in 
( Artetxe et al., 2018b).
The semi-supervised approach uses a small initial seeding dictionary, while 
the unsupervised approach is run without any bilingual information. The lat -
ter uses similarity matrices of both embeddings to build an initial dictionary. 
This initial dictionary is usually of low but sufficient quality for later process -
ing. After the initial dictionary (either by seeding dictionary or using simi -
larity matrices) is built, an iterative algorithm is applied. The algorithm first 
computes optimal mapping using the pseudo-inverse approach for the given 
initial dictionary. The optimal dictionary for the given embeddings is then 
computed, and the procedure iterates with the new dictionary.
When constructing mappings between embedding spaces, a bilingual diction -
ary can help as its entries are used as anchors for the alignment map for su -
pervised and semi-supervised approaches. However, lately, researchers have 
proposed methods that do not require a bilingual dictionary but rely on the 
2 https://github.com/artetxem/vecmap
Slovenscina_2_2021_1 korekture3.indd   5 Slovenscina_2_2021_1 korekture3.indd   5 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
6 7
Slovenščina 2.0, 2021 (1)
adversarial approach (Conneau et al., 2018) or use the words’ frequencies (Ar -
tetxe et al., 2018b) to find a required transformation. These are called unsu -
pervised approaches.
2.2 Projecting into a joint vector space
To construct a common vector space for all the processed languages, one re -
quires a large aligned bilingual or multilingual parallel corpus. The construct -
ed embeddings must map the same words in different languages as close as 
possible in the common vector space. The availability and quality of align -
ments in the training set corpus may present an obstacle. While Wikipedia, 
subtitles, and translation memories are good sources of aligned texts for large 
languages, less-resourced languages are not well-presented and building em -
beddings for such languages is a challenge.
LASER (Language-Agnostic SEntence Representations) is a Facebook re -
search project focusing on joint sentence representation for many languages 
(Artetxe and Schwenk, 2019). Strictly speaking, LASER is not a word but sen -
tence embedding method. Similarly to machine translation architectures, LA -
SER uses an encoder-decoder architecture. The encoder is trained on a large 
parallel corpus, translating a sentence in any language or script to a parallel 
sentence in either English or Spanish (whichever exists in the parallel corpus), 
thereby forming a joint representation of entire sentences in many languages 
in a shared vector space. The project focused on scaling to many languages; 
currently, the encoder supports 93 different languages. Using LASER, one 
can train a classifier on data from just one language and use it on any lan -
guage supported by LASER. A vector representation in the joint embedding 
space can be transformed back into a sentence using a decoder for the specific 
language. 
2.3 Multilingual BERT and CroSloEngual BERT
BERT (Bidirectional Encoder Representations from Transformers) embed -
ding (Devlin et al., 2019) generalises the idea of a language model (LM) to 
masked LMs, inspired by the cloze test, which checks understanding of a 
text by removing a few words, which the participant is asked to replace. 
The masked LM randomly masks some of the tokens from the input, and 
Slovenscina_2_2021_1 korekture3.indd   6 Slovenscina_2_2021_1 korekture3.indd   6 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
6 7
M. ROBNIK-ŠIKONJA, K. REBA, I. MOZETIČ: Cross-lingual transfer of sentiment classifiers
the task is to predict the missing token based on its neighbourhood. BERT 
uses transformer neural networks (Vaswani et al., 2017) in a bidirectional 
sense and further introduces the task of predicting whether two sentences 
appear in a sequence. The input representation of BERT are sequences of 
tokens representing sub-word units. The input is constructed by summing 
the embeddings of corresponding tokens, segments, and positions. Some 
widespread words are kept as single tokens; others are split into sub-words 
(e.g., frequent stems, prefixes, suffixes—if needed down to single letter to -
kens). The original BERT project offers pre-trained English, Chinese, and 
multilingual model. The latter, called mBERT, is trained on 104 languages 
simultaneously.
To use BERT in classification tasks only requires adding connections between 
its last hidden layer and new neurons corresponding to the number of classes 
in the intended task. The fine-tuning process is applied to the whole network, 
and all the parameters of BERT and new class-specific weights are fine-tuned 
jointly to maximise the log-probability of correct labels.
Recently, a new type of multilingual BERT models emerged that reduce the 
number of languages in multilingual models. For example, CSE BERT (Ulčar 
and Robnik-Šikonja, 2020) uses Croatian, Slovene (two similar less-resourced 
languages from the same language family), and English. The main reasons for 
this choice are to represent each language better and keep sensible sub-word 
vocabulary, as shown by Virtanen et al. (2019). This model is built with the 
cross-lingual transfer of prediction models in mind. As CSE BERT includes 
English, we expect that it will enable a better transfer of existing prediction 
models from English to Croatian and Slovene. 
2.4 Twitter sentiment classification
We present a brief overview of the related work on automated sentiment clas -
sification of Twitter posts. We summarise the published labelled sets used for 
training the classification models and the machine learning methods applied 
for training. Most of the related work is limited to only English texts.
To train a sentiment classifier, one needs a reasonably large training dataset 
of tweets already labelled with the sentiment. One can rely on a proxy, e.g., 
Slovenscina_2_2021_1 korekture3.indd   7 Slovenscina_2_2021_1 korekture3.indd   7 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
8 9
Slovenščina 2.0, 2021 (1)
emoticons used in the tweets, to determine the intended sentiment; how -
ever, high-quality labelling requires the engagement of human annotators. 
There exist several publicly available and manually labelled Twitter data -
sets. They vary in the number of examples from several hundred to several 
thousand, but to the best of our knowledge, so far, none exceeds 20,000 
entries. Saif et al. (2013) describe eight Twitter sentiment datasets and in -
troduce a new one that contains separate sentiment labels for tweets and en -
tities. Rosenthal et al. (2015) provide statistics for several of the 2013–2015 
SemEval datasets. 
There are several supervised machine learning algorithms suitable to train 
sentiment classifiers from sentiment labelled tweets. For example, in the 
 SemEval-2015 competition, before the rise of deep neural networks, the most 
often used algorithms for the sentiment analysis on Twitter (Rosenthal et al., 
2015) were support vector machines (SVM), maximum entropy, conditional 
random fields, and linear regression. In other cases, frequently used classi -
fiers were naive Bayes, k-nearest neighbours, and even decision trees. Often, 
SVM was shown as the best performing classifier for the Twitter sentiment. 
However, only recently, when researchers started to apply deep learning for 
the Twitter sentiment classification, considerable improvements in classifi -
cation performance were observed (Wehrmann et al., 2017; Jianqiang et al., 
2018; Naseem et al., 2020). Similarly to our approach, recent approaches use 
contextual embeddings such as ELMo (Peters et al., 2018) and BERT (Devlin 
et al., 2019), but in a monolingual setting.
2.5 Transfer of trained models
Cross-lingual word embeddings can be used directly as inputs in natural 
language processing models. The main idea is to train a model on data from 
one language and then apply it to another, relying on shared cross-lingual 
representation. Several tasks have been attempted in testing cross-lingual 
transfe. Søgaard et al. (2019) survey the transfer in the following tasks: doc -
ument classification, dependency parsing, POS tagging, named entity recog -
nition, super-sense tagging, semantic parsing, discourse parsing, dialogue 
state tracking, entity linking (wikification), sentiment analysis, machine 
translation, natural language interference, etc. For example, Ranasinghe 
Slovenscina_2_2021_1 korekture3.indd   8 Slovenscina_2_2021_1 korekture3.indd   8 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
8 9
M. ROBNIK-ŠIKONJA, K. REBA, I. MOZETIČ: Cross-lingual transfer of sentiment classifiers
and Zampieri (2020) apply large pretrained models in a similar way as we 
but use offensive language domain and only four languages from differ -
ent families (English, Spanish, Bengali, and Hindu). In sentiment analysis, 
which is of particular interest in this work, Mogadala and Rettinger (2016) 
evaluate their embeddings on the multilingual Amazon product review da -
taset. In the Twitter sentiment analysis, Wehrmann et al. (2017) use LSTM 
networks but first learn a joint representation for four languages (English, 
German, Portuguese, and Spanish) with character-based convolutional 
neural networks.
3 DATASETS AND EXPERIMENTAL SETTINGS
This section presents the evaluation metrics, experimental data, and imple -
mentation details of the used neural prediction models.
3.1 Evaluation metrics
Following Mozetič et al. (2016), we report the F‾
1
 score and classification accu -
racy ( CA). The F
1
(c) score for class value c is the harmonic mean of precision p 
and recall r for the given class c, where the precision is defined as the propor -
tion of correctly classified instances from the instances predicted to be from 
the class c, and the recall is the proportion of correctly classified instances 
actually from the class c:
The F
1 
score returns values from the [0 ,1] interval, where 1 means perfect clas -
sification, and 0 indicates that either precision or recall for class c is 0. We 
use an instance of the F
1
 score specifically designed to evaluate the 3-class 
sentiment models (Kiritchenko et al., 2014). F‾
1 
is defined as the average over 
the positive (+) and negative (−) sentiment class: 
F‾
1 
implicitly considers the ordering of sentiment values by considering only 
the extreme labels, positive (+) and negative (-). The middle, neutral, is taken 
Slovenscina_2_2021_1 korekture3.indd   9 Slovenscina_2_2021_1 korekture3.indd   9 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
10 11
Slovenščina 2.0, 2021 (1)
into account indirectly. F‾
1 
= 1 implies that all negative and positive tweets were 
correctly classified, and as a consequence, all neutrals as well. F‾
1
 = 0 indicates 
that all tweets were classified as neutral, and consequently, all negative and 
positive tweets were incorrectly classified. 
F‾
1
 is not the best performance measure. First, taking the arithmetic average 
of the F
1 
scores over different classes (called macro F
1
) is methodologically 
misguided (Flach and Kull, 2015). It is justified only when the class distri -
bution is approximately even, as in our case. Second, F‾
1
 does not account for 
correct classifications by chance. A more appropriate measure that allows for 
class ordering, classification by chance, and class labelling with disagreements 
is Krippendorff’s alpha-reliability (Krippendorff, 2013). However, since F‾
1
 is 
commonly used in the sentiment classification community, and the results are 
typically well-correlated with the alpha-reliability, we decided to report our 
experimental results in terms of F‾
1
.
The second score we report is the classification accuracy CA, defined as the 
ratio of correctly predicted tweets N
c 
to all the tweets N:
3.2 Datasets
We use a corpus of Twitter sentiment datasets (Mozetič et al., 2016), con -
sisting of 15 languages, with over 1.6 million annotated tweets. The languag -
es covered are Albanian, Bosnian, Bulgarian, Croatian, English, German, 
Hungarian, Polish, Portuguese, Russian, Serbian, Slovak, Slovene, Spanish, 
and Swedish. The authors studied the annotators’ agreement on the labelled 
tweets. They discovered that the SVM classifier achieves significantly lower 
score for some languages (English, Russian, Slovak) than the annotators. This 
hints that there might be room for improvement for these languages using a 
better classification model or a larger training set.
We cleaned the above datasets by removing the duplicated tweets, weblinks, 
and hashtags. Due to the low quality of sentiment annotations indicated by 
low self-agreement and low inter-annotator agreement, we removed Albanian 
and Spanish datasets. For these two languages, the self-agreement expressed 
with F‾
1
 score is 0.60 and 0.49, respectively; the inter-annotator agreement is 
Slovenscina_2_2021_1 korekture3.indd   10 Slovenscina_2_2021_1 korekture3.indd   10 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
10 11
M. ROBNIK-ŠIKONJA, K. REBA, I. MOZETIČ: Cross-lingual transfer of sentiment classifiers
0.41 and 0.42. As defined above, F‾
1
 is the arithmetic average of F
1
 scores for 
the positive and negative tweets, where F
1
(c) is the fraction of equally labelled 
tweets out of all the tweets with the label c.
In the paper where the datasets were introduced (Mozetič et al., 2016), Ser -
bian, Croatian, and Bosnian tweets were merged into a single dataset. The 
three languages are very similar and difficult to distinguish in short Twitter 
posts. However, it turned out that this merge resulted in a poor classification 
performance due to a very different quality of annotations. In particular, 
Serbian (71,721 tweets) was annotated by 11 annotators, where two of them 
accounted for over 40% of the annotations. All the inter-annotator agree -
ment measures come from the Serbian only (1,880 tweets annotated twice 
by different annotators, F‾
1
 is 0.51), and there are very few tweets annotated 
twice by the same annotator (182 tweets only, F‾
1
 for the self-agreement is 
0.46). In contrast, all the Croatian and Bosnian tweets were annotated by a 
single annotator, and we have reliable self-agreement estimates. There are 
84,001 Croatian tweets, 13,290 annotated twice, and the self-agreement F‾
1
 
is 0.83. There are 38,105 Bosnian tweets, 6,519 annotated twice, and the 
self-agreement F‾
1
 is 0.78. The authors concluded that the annotation quality 
of the Croatian and Bosnian tweets is considerably higher than that of the 
Serbian. If one constructs separate sentiment classifiers for each language, 
one observes a very different performance than reported originally. The in -
dividual classifiers are better and “well-behaved” compared to the joint Ser -
bian/Croatian/Bosnian model. In this paper, we follow the authors’ sugges -
tion that datasets with no overlapping annotations and different annotation 
quality are better not merged. As a consequence, the Serbian, Croatian, and 
Bosnian datasets are analysed separately. The characteristics of all the 13 
datasets are presented in Table 1.
Slovenscina_2_2021_1 korekture3.indd   11 Slovenscina_2_2021_1 korekture3.indd   11 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
12 13
Slovenščina 2.0, 2021 (1)
Table 1: The characteristics of datasets
Number of tweets Agreement ( F‾
1
)
Language Negative Neutral Positive All Self- Inter-
Bosnian 12,868 11,526 13,711 38,105 0.78 -
Bulgarian 15,140 31,214 20,815 67,169 0.77 0.50
Croatian 21,068 19,039 43,894 84,001 0.83 -
English 26,674 46,972 29,388 103,034 0.79 0.67
German 20,617 60,061 28,452 109,130 0.73 0.42
Hungarian 10,770 22,359 35,376 68,505 0.76 -
Polish 67,083 60,486 96,005 223,574 0.84 0.67
Portuguese 58,592 53,820 44,981 157,393 0.74 -
Russian 34,252 44,044 29,477 107,773 0.82 -
Serbian 24,860 30,700 16,161 71,721 0.46 0.51
Slovak 18,716 14,917 36,792 70,425 0.77 -
Slovene 38,975 60,679 34,281 133,935 0.73 0.54
Swedish 25,319 17,857 15,371 58,547 0.76 -
Note. The left-hand side reports the number of tweets from each category and the overall number 
of instances for individual languages. The right-hand side contains self-agreement of annotators 
and inter-annotator agreement for tried languages where more than one annotator was involved.
3.3 Implementation details
In our experiments, we use three different types of prediction models, BiL -
STM neural networks using joint vector space embeddings constructed with 
the LASER library, and two variants of BERT, mBERT, and CSE BERT. The 
original mBERT (bert-multi-cased) is pretrained on 104 languages, has 12 
transformer layers, and 110 million parameters. The CSE BERT uses the same 
architecture but is pretrained only on Croatian, Slovene, and English. In the 
construction of sentiment classification models, we fine-tune the whole net -
work, using the batch size of 32, 2 epochs, and Adam optimiser. We also tested 
larger numbers of epochs and larger batch sizes in preliminary experiments, 
but this did not improve the performance.
The cross-lingual embeddings from the LASER library are pretrained on 93 
languages, using BiLSTM networks, and are stored as 1024 dimensional em -
bedding vectors. Our classification models contain an embedding layer, fol -
lowed by a multilayer perceptron hidden layer of size 8, and an output layer 
Slovenscina_2_2021_1 korekture3.indd   12 Slovenscina_2_2021_1 korekture3.indd   12 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
12 13
M. ROBNIK-ŠIKONJA, K. REBA, I. MOZETIČ: Cross-lingual transfer of sentiment classifiers
with three neurons (corresponding to three output classes, negative, neu -
tral, and positive sentiment) using the softmax. We use the ReLU activation 
function and Adam optimiser. The fine-tuning uses a batch size of 32 and 10 
epochs.
Further technical details are available in the freely available source code.
4 EXPERIMENTS AND RESULTS
Our experimental work focuses on model transfer with cross-lingual embed -
dings. However, to first establish the suitability of different embedding spac -
es for Twitter sentiment classification, we start with their comparison in a 
monolingual setting in Section 4.1. We compare the three neural approaches 
presented in Section 3.3 (common vector space of LASER, mBERT, and CSE 
BERT). As a baseline, we use the classical approach using bag-of-ngram rep -
resentation with the SVM classifier. In the cross-lingual experiments, we fo -
cus on the two most-successful types of model transfer, described in Sections 
2.2 and 2.3: the common vector space of the LASER library and the variants 
of the multilingual BERT model (mBERT and CSE BERT). We conducted sev -
eral cross-lingual transfer experiments: transfer of models between languages 
from the same (Section 4.2) and different language family (Section 4.3), as 
well as the expansion of training sets with varying amounts of data from other 
languages (Section 4.4). In the experiments, we did not systematically test all 
possible combinations of languages and language groups as this would require 
an excessive amount of computational time and reporting space, and would 
not contribute to the clarity of the paper. Instead, we arbitrarily selected a 
representative set of language combinations in advance. We leave a compre -
hensive systematic approach based on informative features (Lin et al., 2019) 
for further work.
4.1 Comparing embedding spaces
To establish the appropriateness of different embedding approaches for our 
Twitter sentiment classification task, we start with experiments in a mono -
lingual setting. We compare embeddings into a joint vector space obtained 
with the LASER library with mBERT and CSE BERT. Note that there is no 
transfer between different languages in this experiment but only a test of 
Slovenscina_2_2021_1 korekture3.indd   13 Slovenscina_2_2021_1 korekture3.indd   13 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
14 15
Slovenščina 2.0, 2021 (1)
the suitability of the representation, i.e. embeddings. To make the results 
comparable with previous work on these datasets, we report results obtained 
with 10-fold blocked cross-validation. There is no randomisation of training 
examples in the blocked cross-validation, and each fold is a block of con -
secutive tweets. It turns out that standard cross-validation with a random 
selection of examples yields unrealistic estimates of classifier performance 
and should not be used to evaluate classifiers in time-ordered data scenarios 
(Mozetič et al., 2018). 
As a baseline, we report the results of SVM models without neural embed -
dings that use Delta TF-IDF weighted bag-of-ngrams representation with 
substantial preprocessing of tweets (Mozetič et al., 2016). As the datasets for 
the Bosnian, Croatian, and Serbian languages were merged in (Mozetič et al., 
2016) due to the similarity of these languages, we report the performance on 
the merged dataset for the SVM classifier. Results are presented in Table 2.
Table 2: Comparison of different representations: supervised mapping into a joint vector space 
with the LASER library, mBERT, CSE BERT, and bag-of-ngrams with the SVM classifier
LASER mBERT CSE BERT SVM
Language F‾
1
CA F‾
1
CA F‾
1
CA F‾
1
CA
Bosnian 0.68 0.64 0.65 0.60 0.68 0.65 (0.61 0.56)
Bulgarian 0.53 0.59 0.58 0.59 0.00 0.45 0.52 0.54
Croatian 0.72 0.68 0.64 0.66 0.76 0.71 (0.61 0.56)
English 0.62 0.65 0.68 0.68 0.67 0.66 0.63 0.64
German 0.52 0.64 0.66 0.66 0.31 0.59 0.54 0.61
Hungarian 0.63 0.67 0.65 0.69 0.57 0.65 0.64 0.67
Polish 0.70 0.66 0.70 0.70 0.56 0.57 0.68 0.63
Portuguese 0.48 0.47 0.50 0.49 0.12 0.22 0.55 0.51
Russian 0.70 0.70 0.64 0.64 0.07 0.43 0.61 0.60
Serbian 0.50 0.54 0.50 0.52 0.30 0.50 (0.61 0.56)
Slovak 0.72 0.72 0.67 0.66 0.69 0.71 0.68 0.68
Slovene 0.57 0.58 0.58 0.58 0.60 0.61 0.55 0.54
Swedish 0.67 0.64 0.67 0.65 0.54 0.56 0.66 0.62
#Best 5 3 6 6 3 3 2 2
Note. The best score for each language and metric is in bold. In the last row, we count the number 
of best scores for each model. The SVM results for Bosnian, Croatian, and Serbian were obtained 
with the model trained on the merged dataset of these languages model and are therefore not 
directly compatible with the language-specific results for the other representations.
Slovenscina_2_2021_1 korekture3.indd   14 Slovenscina_2_2021_1 korekture3.indd   14 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
14 15
M. ROBNIK-ŠIKONJA, K. REBA, I. MOZETIČ: Cross-lingual transfer of sentiment classifiers
The SVM baseline using bag-of-ngrams representation mostly achieves lower 
predictive performance than the two neural embedding approaches. We spec -
ulate that the main reason is more information about the language structure 
contained in precomputed dense embeddings used by the neural approach -
es. Together with the fact that standard feature-based machine learning ap -
proaches require much more preprocessing effort, it seems that there are no 
good reasons why to bother with this approach in text classification; we, there -
fore, omit this method from further experiments. The mBERT model is the 
best of the tested methods, achieving the best F‾
1
 and CA scores in six languag -
es (in bold), closely followed by the LASER approach, which achieves the best 
F‾
1
 score in five languages and the best CA score in three languages. The CSE 
BERT is specialised for only three languages, and it achieves the best scores 
in languages where it is trained (except in English, where it is close behind 
mBERT), and in Bosnian, which is similar to Croatian. Overall, it seems that 
large pretrained transformer models (mBERT and CSE BERT) are dominat -
ing in the Twitter sentiment prediction. The downside of these models is that 
their training, fine-tuning, and execution require more computational time 
than precomputed fixed embeddings. Nevertheless, with progress in optimi -
sation techniques for neural network learning and advent of computationally 
more efficient BERT variants, e.g., (You et al., 2020), this obstacle might dis -
appear in the future.
4.2 Transfer to the same language family
The transfer of prediction models between similar languages from the same 
language family is the most likely to be successful. We test several combina -
tions of source and target languages from Slavic and Germanic language fam -
ilies. We report the results in Table 3.
In each experiment, we use the entire dataset(s) of the source language as the 
training set and the whole dataset of the target language as the testing set, 
i.e. we do a zero-shot transfer. We compare the results with the LASER em -
beddings with BiLSTM network using training and testing set from the target 
language, where 70% of the dataset is used for training and 30% for testing. As 
we use large datasets, the latter results can be taken as an upper bound of what 
cross-lingual transfer models could achieve in ideal conditions.
Slovenscina_2_2021_1 korekture3.indd   15 Slovenscina_2_2021_1 korekture3.indd   15 30. 06. 2021   07:56:29 30. 06. 2021   07:56:29
16 17
Slovenščina 2.0, 2021 (1)
The results from Table 3 (bottom line) show that there is a gap in the perfor -
mance of transfer learning models and native models. On average, the gap 
in F‾
1
 is 5% for the LASER approach, 6% for mBERT, and 8% for CSE BERT. 
For CA, the average gap is 7% for both LASER and mBERT and 8% for CSE 
BERT. However, there are significant differences between languages, and 
we advise to test both LASER and mBERT for a specific new language, as 
the models are highly competitive. The CSE BERT is slightly less successful 
measured with the average performance gap over all languages as the gap 
is 8% in both F‾
1
 and CA. However, if we take only the three languages used 
in the training of CSE BERT (Croatian, Slovene, and English) as shown in 
Table 3: The transfer of trained models between languages from the same language family 
using LASER common vector space, mBERT, and CSE BERT
LASER mBERT CSE BERT Both target
Source Target F‾
1
CA F‾
1
CA F‾
1
CA F‾
1
CA
German English 0.55 0.59 0.63 0.64 0.42 0.42 0.62 0.65
English German 0.55 0.60 0.66 0.70 0.50 0.58 0.53 0.65
Polish Russian 0.64 0.59 0.57 0.57 0.50 0.40 0.70 0.70
Polish Slovak 0.63 0.59 0.58 0.59 0.63 0.65 0.72 0.72
German Swedish 0.58 0.57 0.59 0.59 0.58 0.56 0.67 0.65
German Swedish English 0.58 0.60 0.55 0.56 0.41 0.42 0.62 0.65
Slovene Serbian Russian 0.53 0.55 0.57 0.57 0.58 0.48 0.70 0.70
Slovene Serbian Slovak 0.59 0.52 0.57 0.59 0.48 0.60 0.72 0.72
Serbian Slovene 0.54 0.57 0.54 0.54 0.56 0.55 0.60 0.60
Serbian Croatian 0.67 0.64 0.65 0.62 0.65 0.70 0.73 0.68
Serbian Bosnian 0.65 0.61 0.61 0.60 0.59 0.62 0.67 0.64
Polish Slovene 0.51 0.48 0.55 0.54 0.50 0.53 0.60 0.60
Slovak Slovene 0.52 0.51 0.54 0.54 0.58 0.58 0.60 0.60
Croatian Slovene 0.53 0.53 0.53 0.54 0.61 0.60 0.60 0.60
Croatian Serbian 0.54 0.52 0.52 0.51 0.52 0.49 0.48 0.54
Croatian Bosnian 0.66 0.61 0.57 0.56 0.67 0.62 0.67 0.64
Slovene Croatian 0.70 0.65 0.64 0.63 0.73 0.69 0.73 0.68
Slovene Serbian 0.52 0.55 0.46 0.49 0.47 0.50 0.48 0.54
Slovene Bosnian 0.66 0.61 0.58 0.56 0.66 0.62 0.67 0.64
Average performance gap 0.05 0.07 0.06 0.07 0.08 0.08
Note. We compare the results with both training and testing set from the target language using 
the LASER approach (the right-most two columns).
Slovenscina_2_2021_1 korekture3.indd   16 Slovenscina_2_2021_1 korekture3.indd   16 30. 06. 2021   07:56:30 30. 06. 2021   07:56:30
16 17
M. ROBNIK-ŠIKONJA, K. REBA, I. MOZETIČ: Cross-lingual transfer of sentiment classifiers
Table 4, conclusions are entirely different. The average performance gap is 
0% in F‾
1
 and 1% in the classification accuracy, meaning that we get almost a 
perfect cross-lingual transfer for these languages on the Twitter sentiment 
prediction task. 
We also tried more than one input language at once, for example, German and 
Swedish as source languages and English as the target language, as shown in 
Table 3. The success of the tested combinations is mixed: for some models and 
some languages, we slightly improve the scores, while for others, we slightly 
decrease them. We hypothesise that our datasets for individual languages are 
large enough so that adding additional training data does not help.
Table 4: The transfer of sentiment models between all combinations of languages on which CSE 
BERT was trained (Croatian, Slovene, and English)
LASER mBERT CSE BERT Both target
Source Target F‾
1
CA F‾
1
CA F‾
1
CA F‾
1
CA
Croatian Slovene 0.53 0.53 0.53 0.54 0.61 0.60 0.60 0.60
Croatian English 0.63 0.63 0.63 0.66 0.62 0.64 0.62 0.65
English Slovene 0.54 0.57 0.50 0.53 0.59 0.57 0.60 0.60
English Croatian 0.62 0.67 0.67 0.63 0.73 0.67 0.73 0.68
Slovene English 0.63 0.64 0.65 0.67 0.63 0.64 0.62 0.65
Slovene Croatian 0.70 0.65 0.64 0.63 0.73 0.69 0.73 0.68
Croatian EnglishSlovene 0.54 0.54 0.53 0.54 0.60 0.58 0.60 0.60
Croatian SloveneEnglish 0.62 0.61 0.65 0.67 0.63 0.65 0.62 0.65
English Slovene Croatian 0.64 0.68 0.63 0.63 0.68 0.70 0.73 0.68
Average performance gap 0.04 0.03 0.04 0.03 0.00 0.01
4.3 Transfer to a different language family
The transfer of prediction models between languages from different language 
families is less likely to be successful. Nevertheless, to observe the difference, 
we test several combinations of source and target languages from different 
language families (one from Slavic, the other from Germanic, and vice-versa). 
We compare the LASER approach with mBERT models; the CSE BERT is not 
constructed for this setting, and we skip it in this experiment. We report the 
results in Table 5.
Slovenscina_2_2021_1 korekture3.indd   17 Slovenscina_2_2021_1 korekture3.indd   17 30. 06. 2021   07:56:30 30. 06. 2021   07:56:30
18 19
Slovenščina 2.0, 2021 (1)
The results show that with the LASER approach, there is an average decrease 
of performance for transfer learning models of 11% (both F‾
1
 and CA), and for 
mBERT, the gap is 9%. This gap is significant and makes the resulting trans -
ferred models less useful in the target languages, though there are considera -
ble differences between the languages. 
Table 5: The transfer of trained models between languages from different language families 
using LASER common vector space and mBERT
LASER mBERT Both target
Source Target F‾
1
CA F‾
1
CA F‾
1
CA
Russian English 0.52 0.56 0.52 0.57 0.62 0.65
English Russian 0.57 0.58 0.55 0.57 0.70 0.70
English Slovak 0.46 0.44 0.57 0.58 0.72 0.72
Polish, Slovene English 0.58 0.57 0.60 0.60 0.62 0.65
German, 
Swedish
Russian 0.61 0.61 0.62 0.59 0.70 0.70
English, German Slovak 0.50 0.47 0.56 0.54 0.72 0.72
German Slovene 0.54 0.56 0.53 0.54 0.60 0.60
English Slovene 0.54 0.57 0.50 0.53 0.60 0.60
Swedish Slovene 0.54 0.56 0.52 0.54 0.60 0.60
Hungarian Slovene 0.52 0.52 0.53 0.54 0.60 0.60
Portuguese Slovene 0.51 0.49 0.54 0.54 0.60 0.60
Average performance gap 0.11 0.11 0.09 0.09
Note. We compare the results with both training and testing set from the target language using 
the LASER approach (the right-most two columns). 
4.4 Increasing datasets with several languages
Another type of cross-lingual transfer is possible if we increase the training 
sets with instances from several related and unrelated languages. We conduct 
two sets of experiments in this scenario. In the first setting, reported in Ta -
ble 6, we constructed the training set in each experiment with instances from 
several languages and 70% of the target language dataset. The remaining 30% 
of target language instances are used as the testing set. In the second setting, 
reported in Table 7, we merge all other languages and 70% of the target lan -
guage into a joint training set. We compare the LASER approach, mBERT, and 
also CSE BERT, as Slovene and Croatian are involved in some combinations.
Slovenscina_2_2021_1 korekture3.indd   18 Slovenscina_2_2021_1 korekture3.indd   18 30. 06. 2021   07:56:30 30. 06. 2021   07:56:30
18 19
M. ROBNIK-ŠIKONJA, K. REBA, I. MOZETIČ: Cross-lingual transfer of sentiment classifiers
Table 6 shows a gap between learning models using the expanded datasets 
and models with only target language data. The decrease is more extensive for 
both BERT models (on average around 10%) than for the LASER approach (the 
decrease is on average 3% for FF‾
1
 and 5% for CA). These results indicate that 
the tested expansion of datasets was unsuccessful, i.e. the provided amount of 
training instances in the target language was already sufficient for successful 
learning. The additional instances from other languages in the transformed 
space are likely to be of lower quality than the native instances and therefore 
decrease the performance. 
Table 6: The expansion of training sets with instances from several languages
LASER mBERT CSEBERT Target only
Source Target F‾
1
CA F‾
1
CA F‾
1
CA F‾
1
CA
English, Croatian, 
Slovene
Slovene 0.58 0.53 0.46 0.45 0.60 0.58 0.60 0.60
English, Croatian, 
Serbian, Slovak
Slovak 0.67 0.65 0.57 0.54 0.27 0.37 0.72 0.72
Hungarian, 
Slovak, English, 
Croatian, Russian
Russian 0.67 0.65 0.61 0.59 0.63 0.61 0.70 0.70
Russian, Swedish, 
English
English 0.60 0.61 0.62 0.60 0.59 0.62 0.62 0.65
Croatian, Serbian, 
Bosnian, Slovene
Slovene 0.54 0.58 0.44 0.45 0.57 0.56 0.60 0.60
English, Swedish, 
German
German 0.55 0.60 0.60 0.64 0.47 0.58 0.53 0.65
Average 
performance gap
0.03 0.05 0.08 0.11 0.11 0.10
Note. We compare the LASER approach, mBERT, and CSE BERT. As the upper bound, we give 
results of the LASER approach trained on only the target language. 
The results in Table 7, where we test the expansion of the training set (con -
sisting of 70% of the dataset in the target language) with all other languages, 
show that using many languages and significant enlargement of datasets is 
also not successful. The two improvements in the LASER approach over using 
only target language are limited to a single metric ( F
1
 in case of Bulgarian and 
Serbian), which indicates that true positives are favoured at the expense of 
true negatives. For all the other languages, the tried expansions of training 
sets are unsuccessful for the LASER approach; the difference to native models 
Slovenscina_2_2021_1 korekture3.indd   19 Slovenscina_2_2021_1 korekture3.indd   19 30. 06. 2021   07:56:30 30. 06. 2021   07:56:30
20 21
Slovenščina 2.0, 2021 (1)
is on average 3.5% for the F‾
1
 score and 6% for CA. The mBERT models are in 
almost all cases more successful in this massive transfer than LASER models, 
and they sometimes marginally beat the reference mBERT approach trained 
only on the target language.
Table 7: The expansion of training sets with instances from all other languages (+70% of the 
target language instances) to train the LASER approach and mBERT
LASER mBERT
All & Target Only Target All &Target Only Target
Target F‾
1
CA F‾
1
CA F‾
1
CA F‾
1
CA
Bosnian 0.64 0.59 0.67 0.64 0.63 0.60 0.65 0.60
Bulgarian 0.54 0.56 0.50 0.59 0.60 0.60 0.58 0.59
Croatian 0.63 0.57 0.73 0.68 0.65 0.63 0.64 0.66
English 0.58 0.60 0.62 0.65 0.64 0.69 0.68 0.68
German 0.52 0.59 0.53 0.65 0.61 0.66 0.66 0.66
Hungarian 0.59 0.61 0.60 0.67 0.65 0.69 0.65 0.69
Polish 0.67 0.63 0.70 0.66 0.71 0.71 0.70 0.70
Portuguese 0.44 0.39 0.52 0.51 0.52 0.52 0.50 0.49
Russian 0.66 0.64 0.70 0.70 0.67 0.66 0.64 0.64
Serbian 0.52 0.49 0.48 0.54 0.53 0.51 0.50 0.52
Slovak 0.64 0.61 0.72 0.72 0.67 0.65 0.67 0.66
Slovene 0.54 0.50 0.60 0.60 0.56 0.54 0.58 0.58
Swedish 0.63 0.59 0.67 0.65 0.67 0.64 0.67 0.65
Avg. gap 0.03 0.06 0.00 0.00
Note. We compare the results with the training on only the target language. The scores where 
models with the expanded training sets beat their respective reference scores are in bold.
5 CONCLUSIONS
We studied state-of-the-art approaches to the cross-lingual transfer of Twit -
ter sentiment prediction models: mappings of words into the common vector 
space using the LASER library and two multilingual BERT variants (mBERT 
and trilingual CSE BERT). Our empirical evaluation is based on relatively 
large datasets of labelled tweets from 13 European languages. We first test -
ed the success of these text representations in a monolingual setting. The re -
sults show that BERT variants are the most successful, closely followed by the 
LASER approach, while the classical bag-of-ngrams coupled with the SVM 
Slovenscina_2_2021_1 korekture3.indd   20 Slovenscina_2_2021_1 korekture3.indd   20 30. 06. 2021   07:56:30 30. 06. 2021   07:56:30
20 21
M. ROBNIK-ŠIKONJA, K. REBA, I. MOZETIČ: Cross-lingual transfer of sentiment classifiers
classifier is no longer competitive with neural approaches. In the cross-lingual 
experiments, the results show that there is a significant transfer potential us -
ing the models trained on similar languages; compared to training and testing 
on the same language, with LASER, we get on average 5% lower F‾
1
 score and 
with mBERT 6% lower F‾
1
 score. The transfer of models with CSE BERT is even 
more successful in the three languages covered by this model, where we get no 
performance gap compared to the LASER approach trained and tested on the 
target language. Using models trained on languages from different language 
families produces larger differences (on average around 10% for F‾
1
 and CA). 
Our attempt to expand training sets with instances from different languages 
was unsuccessful using either additional instances from a small group of lan -
guages or instances from all other languages. The source code of our analyses 
is freely available
3
.
We plan to expand BERT models with additional emotional and subjectivity 
information in future work on sentiment classification. Given the favourable 
results in cross-lingual transfer, we will expand the work to other relevant 
tasks. 
Acknowledgments
The research was supported by the Slovene Research Agency through research 
core funding no. P6-0411 and P2-103, as well as project no. J6-2581. This pa -
per is supported by European Union’s Horizon 2020 Programme project EM -
BEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in Eu -
ropean News Media, grant no. 825153), and Rights, Equality and Citizenship 
Programme project IMSyPP (Innovative Monitoring Systems and Prevention 
Policies of Online Hate Speech, grant no. 875263). The results of this publica -
tion reflect only the authors’ view, and the Commission is not responsible for 
any use that may be made of the information it contains.
3 https://github.com/kristjanreba/cross-lingual-classification-of-tweet-sentiment
Slovenscina_2_2021_1 korekture3.indd   21 Slovenscina_2_2021_1 korekture3.indd   21 30. 06. 2021   07:56:30 30. 06. 2021   07:56:30
22 23
Slovenščina 2.0, 2021 (1)
REFERENCES
Artetxe, M., Labaka, G., & Agirre, E. (2018a). Generalising and improving bi -
lingual word embedding mappings with a multi-step framework of lin -
ear transformations. In Thirty-Second AAAI Conference on Artificial 
Intelligence.
Artetxe, M., Labaka, G., & Agirre, E. (2018b). A robust self-learning method 
for fully unsupervised crosslingual mappings of word embeddings. In Pro-
ceedings of the 56th Annual Meeting of the Association for Computation-
al Linguistics:Vol 1 (Long Papers) (pp. 789–798).
Artetxe, M., & Schwenk, H. (2019). Massively multilingual sentence embed -
dings for zero-shot crosslingual transfer and beyond. Transactions of the 
Association for Computational Linguistics, 7, 597–610.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word 
vectors with subword information. Transactions of the Association for 
Computational Linguistics, 5, 135–146.
Conneau, A., Lample, G., Ranzato, M.A., Denoyer, L., & J’egou, H. (2018). 
Word’ translation without parallel data. In 6th Proceedings of Interna-
tional Conference on Learning Representation (ICLR). Retrieved from 
https://openreview.net/pdf?id=H196sainb 
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training 
of deep bidirectional transformers for language understanding. In Pro-
ceedings of the 2019 Conference of the North American Chapter of the 
Association for Computational Linguistics: Human Language Technolo-
gies, Vol. 1 (Long and Short Papers) (pp. 4171–4186).
Flach, P., & Kull, M. (2015). Precision-recall-gain curves: PR analysis done 
right. In Advances in Neural Information Processing Systems (NIPS) 
(pp. 838–846).
Jianqiang, Z., Xiaolin, G., and Xuejun, Z. (2018). Deep convolution neural 
networks for Twitter sentiment analysis. IEEE Access, 6, 23253–23260.
Kiritchenko, S., Zhu, X., Mohammad, S. M. (2014). Sentiment analysis of 
short informal texts. Journal of Artificial Intelligence Research, 50, 
723–762.
Krippendorff, K. (2013). Content Analysis, An Introduction to Its Methodolo-
gy (3rd ed.) Thousand Oaks, CA, USA: Sage Publications.
Slovenscina_2_2021_1 korekture3.indd   22 Slovenscina_2_2021_1 korekture3.indd   22 30. 06. 2021   07:56:30 30. 06. 2021   07:56:30
22 23
M. ROBNIK-ŠIKONJA, K. REBA, I. MOZETIČ: Cross-lingual transfer of sentiment classifiers
Lin, Y. H., Chen, C. Y., Lee, J., Li, Z., Zhang, Y., Xia, M., Rijhwani, S., et al. 
(2019). Choosing transfer languages for cross-lingual learning. In Pro-
ceedings of the 57th Annual Meeting of the Association for Computation-
al Linguistics (ACL) (pp. 3125–3135).
Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting similarities among 
languages for machine translation. arXiv preprint 1309.4168.
Mogadala, A., & Rettinger, A. (2016). Bilingual word embeddings from paral -
lel and non-parallel corpora for cross-language text classification. In Pro-
ceedings of NAACL-HLT (pp. 692–702).
Mozetič, I., Grčar, M., & Smailović, J. (2016). Multilingual Twitter sentiment 
classification: The role of human annotators. PLOS ONE, 11(5). doi: 10.1371/
journal.pone.0155036 
Mozetič, I., Torgo, L., Cerqueira, V., & Smailović, J. (2018). How to evaluate 
sentiment classifiers for Twitter time-ordered data? PLoS ONE 13(3).
Naseem, U., Razzak, I., Musial, K., & Imran, M. (2020). Transformer based 
deep intelligent contextual embedding for Twitter sentiment analysis. Fu-
ture Generation Computer Systems, 113, 58–69.
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettle -
moyer, L. (2018). Deep contextualised word representations. In Proceed-
ings of the 2018 Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Language Technologies, 
Vol. 1 (Long Papers) (pp. 2227–2237).
Ranasinghe, T., & Zampieri, M. (2020). Multilingual Offensive Language 
Identification with Cross-lingual Embeddings. In Proceedings of the 
2020 Conference on Empirical Methods in Natural Language Processing 
(EMNLP) (pp. 5838–5844).
Rosenthal, S., Nakov, P., Kiritchenko, S., Mohammad, S. M., Ritter, A., & 
Stoyanov, V. (2015). SemEval-2015 task 10: Sentiment Analysis in Twit -
ter. In Proceedings of 9th International Workshop on Semantic Evalua-
tion ( SemEval) (pp. 451–463).
Saif, H., Fernández, M., He, Y., Alani, H.(2013). Evaluation datasets for Twit -
ter sentiment analysis: A survey and a new dataset, the STS-Gold. In 1st 
Intl. Workshop on Emotion and Sentiment in Social and Expressive Me-
dia: Approaches and Perspectives from AI (ESSEM).
Slovenscina_2_2021_1 korekture3.indd   23 Slovenscina_2_2021_1 korekture3.indd   23 30. 06. 2021   07:56:30 30. 06. 2021   07:56:30
24 25
Slovenščina 2.0, 2021 (1)
Søgaard, A., Vulić, I., Ruder, S., & Faruqui, M. (2019). Cross-Lingual Word 
Embeddings. Morgan & Claypool Publishers.
Ulčar, M., & Robnik-Šikonja, M. (2020). FinEst BERT and CroSloEngual 
BERT. In International Conference on Text, Speech, and Dialogue (TSD) 
(pp. 104–111).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., 
Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances 
in Neural Information Processing Systems (NIPS) (pp. 5998–6008).
Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luoto-lahti, J., Salakoski, T., 
Ginter, F., & Pyysalo, S. (2019). Multilingual is not enough: BERT for 
Finnish. arXiv preprint 1912.07076.
Wehrmann, J., Becker, W., Cagnini, H. E., & Barros, R. C. (2017). A char -
acter-based convolutional neural network for language-agnostic Twitter 
sentiment analysis. In 2017 International Joint Conference on Neural 
Networks (IJCNN) (pp. 2384–2391).
You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., et al. 
(2020). Large batch optimization for deep learning: Training BERT in 76 
minutes. In 8th International Conference on Learning Representations 
(ICLR), 26-30 April, 2020, Addis Ababa, Ethiopia.
Slovenscina_2_2021_1 korekture3.indd   24 Slovenscina_2_2021_1 korekture3.indd   24 30. 06. 2021   07:56:30 30. 06. 2021   07:56:30
24 25
M. ROBNIK-ŠIKONJA, K. REBA, I. MOZETIČ: Cross-lingual transfer of sentiment classifiers
MEDJEZIKOVNI PRENOS KLASIFIKATORJEV 
SENTIMENTA
Vektorske vložitve predstavijo besede v številski obliki tako, da so semantične 
relacije med besedami zapisane kot razdalje in smeri v vektorskem prostoru. 
Medjezikovne vložitve poravnajo vektorske prostore različnih jezikov, kar po -
dobne besede v različnih jezikih postavi blizu skupaj. Medjezikovna poravnava 
lahko deluje na parih jezikov ali s konstrukcijo skupnega vektorskega prostora 
več jezikov. Medjezikovne vektorske vložitve lahko uporabimo za prenos mode -
lov strojnega učenja med jeziki in s tem razrešimo težavo premajhnih ali neob -
stoječih učnih množic v jezikih z manj viri. V delu uporabljamo medjezikovne 
vložitve za prenos napovednih modelov strojnega učenja za napovedovanje sen -
timenta tvitov med trinajstimi jeziki. Osredotočeni smo na dva, v zadnjem času 
najuspešnejša, načina prenosa modelov. Prvi način uporablja modele naučene 
na skupnem vektorskem prostoru za mnoge jezike, izdelanem s knjižnico LA -
SER. Drugi način uporablja velike, na mnogih jezikih  vnaprej naučene, jezikov -
ne modele tipa BERT. Naši poskusi kažejo, da je prenos modelov med podobni -
mi jeziki smiseln tudi povsem brez učnih podatkov v ciljnem jeziku. Uspešnost 
večjezikovnih modelov BERT in LASER je primerljiva, razlike so odvisne od 
jezika. Medjezikovni prenos z modelom CroSloEngual BERT, predhodno nau -
čenim na le treh jezikih, je v teh in nekaterih sorodnih jezikih še precej boljši.
Ključne besede: obdelava naravnega jezika, strojno učenje, vektorske vložitve be -
sedil, analiza sentimenta, modeli BERT
To delo je ponujeno pod licenco Creative Commons: Priznanje avtorstva-Deljenje pod enakimi 
pogoji 4.0 Mednarodna. / This work is licensed under the Creative Commons Attribution-Share -
Alike 4.0 International.
https://creativecommons.org/licenses/by-sa/4.0/
Slovenscina_2_2021_1 korekture3.indd   25 Slovenscina_2_2021_1 korekture3.indd   25 30. 06. 2021   07:56:30 30. 06. 2021   07:56:30