26 27
Slovenščina 2.0, 2021 (1)
SLOVENE AND CROATIAN WORD  
EMBEDDINGS IN TERMS OF GENDER 
OCCUPATIONAL ANALOGIES
Matej ULČAR
Faculty of Computer and Information Science, University of Ljubljana
Anka SUPEJ
Jožef Stefan Institute
Marko ROBNIK-ŠIKONJA
Faculty of Computer and Information Science, University of Ljubljana
Senja POLLAK
Jožef Stefan Institute
Ulčar, M., Supej, A., Robnik-Šikonja, M., Pollak, S. (2021): Slovene and Croatian word  
embeddings in terms of gender occupational analogies. Slovenščina 2.0, 9(1): 26–59. 
DOI: https://doi.org/10.4312/slo2.0.2021.1.26- 59
In recent years, the use of deep neural networks and dense vector embeddings 
for text representation have led to excellent results in the field of computational 
understanding of natural language. It has also been shown that word embed -
dings often capture gender, racial and other types of bias. The article focuses 
on evaluating Slovene and Croatian word embeddings in terms of gender bias 
using word analogy calculations. We compiled a list of masculine and feminine 
nouns for occupations in Slovene and evaluated the gender bias of fastText, 
word2vec and ELMo embeddings with different configurations and differ -
ent approaches to analogy calculations. The lowest occupational gender bias 
was observed with the fastText embeddings. Similarly, we compared different 
 fastText embeddings on Croatian occupational analogies. 
Keywords: word embeddings, gender bias, word analogy task, occupations, natural 
language processing
Slovenscina_2_2021_1 korekture3.indd   26 Slovenscina_2_2021_1 korekture3.indd   26 30. 06. 2021   07:56:30 30. 06. 2021   07:56:30
26 27
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
1 INTRODUCTION
Gender biases in language are studied from many different perspectives. 
Sociolinguistic studies report how language use differs between men and 
women (e.g., women tend to have a richer vocabulary, use typical grammat -
ical structures, and express themselves more moderately) (Lakoff, 1973; 
Tannen, 1990; Argamon et al., 2003). Observations that language use varies 
between the genders inspired author profiling studies on texts in different 
languages and of different genres (Koolen and van Cranenburgh, 2017; Par -
do et al., 2015; Martinc et al., 2017), also in Slovene (Verhoeven et al., 2017; 
Škrjanec et al., 2018).
1
 
The gender dimension is present as a linguistic variation in corpora and in 
the form of multi-layered bias, both in individual texts and in larger corpora. 
Research suggests that:
• The bias is manifested as lack of mentions of women: corpora often 
used in research contain significantly fewer female pronouns (Zhao 
et al., 2018) or other references to women (Caldas-Coulhard and 
Moon, 2010; Baker, 2010).
• Women are less often authors or editors (Hill and Shaw, 2013): only 
16% of Wikipedia editors are female.
• Corpora capture stereotypical collocations (Pearce, 2008), which re -
fer to women primarily through their reproductive function (Gorjanc, 
2007) and do not associate them with (social) power (Baker, 2010).
Recent rapid developments in natural language processing (NLP) are primar -
ily associated with the use of deep neural networks. Their use requires a rep -
resentation of text in the form of numeric vectors, called word embeddings. 
The relations between words are expressed in the geometry of the embedded 
vector space: semantically related embeddings lie close in the vector space 
and are arranged in similar directions. This enables the study of relations be -
yond superficial similarities between words, e.g. through analogies such as the 
1 Note that in these studies non-binary identities are not considered. Male or female 
gender is assigned based on, for example, author’s username on social media platforms 
or based on other grammatical markers.
Slovenscina_2_2021_1 korekture3.indd   27 Slovenscina_2_2021_1 korekture3.indd   27 30. 06. 2021   07:56:30 30. 06. 2021   07:56:30
28 29
Slovenščina 2.0, 2021 (1)
relationship Madrid:Spain being analogous to the relationship Paris:France 
(Mikolov et al., 2013b).
As it turns out, word embeddings often contain bias, be it gender, race, or oth -
er types. Biases in word embeddings manifest through semantic associations 
and consequent proximities in the vector space (Mikolov et al., 2013b). Bias -
es can be numerically evaluated by, for example, calculating cosine similarity 
between embeddings that describe a specific concept (e.g. gender) and poten -
tially biased concepts. For example, Caliskan et al. (2017) show that word em -
beddings associate women with arts and men with science. Utilizing the afore -
mentioned cosine similarity, a powerful approach to demonstrate potential 
bias in word embeddings is through a calculation of occupational analogies 
(Bolukbasi et al., 2016). Denoting a vector of word w with v(w), this approach 
checks the existence of the following relationships between male and female 
word vectors: v(man) - v(male occupation) ≈ v(woman) - v(female occupa-
tion). An example for Slovene is v(moški) - v(učitelj) ≈ v(ženska) - v(učitel-
jica), where učitelj and učiteljica correspond to the masculine and feminine 
form of the noun for the concept (occupation) teacher, while moški and žen-
ska denote man and woman (the gender concept), respectively. In case of no 
gender bias, the relationship between vectors for man and the masculine form 
of occupation and between the vector for woman and the feminine form of the 
same occupation would be approximately the same, as illustrated in Figure 
1. However, being derived from naturally occurring text, it is not unexpected 
that human biases and social positions are captured in embeddings.
The illustration shows a simplified depiction of a few examples with 2-dimen -
sional vectors. The arrows represent the difference between vectors v(f) and 
v(m). The end points of arrows originating in masculine nouns for occupa -
tions represent the expected positions of equivalent feminine nouns if there 
were no bias.
In addition to studies that have shown the bias in word embeddings, different 
biases can be transferred onto algorithms for different NLP tasks, from ma -
chine translation (Prates et al., 2020; Vanmassenhove et al., 2018) to senti -
ment analysis (Kiritchenko and Mohammad, 2018). On the other hand, some 
authors (Nissim et al., 2019) warn that the analogy task’s design may exces -
sively emphasise biases.
Slovenscina_2_2021_1 korekture3.indd   28 Slovenscina_2_2021_1 korekture3.indd   28 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
28 29
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
Figure 1: A simplified depiction of word vectors. The orange full arrow represents the difference 
between vectors for ženska [woman] and moški [man]. The blue dashed arrow represents the 
difference between vectors for sestra [sister] and brat [brother]. These two arrows indicate the 
expected (non-biased) gender difference vectors. For two male occupations, režiser [film direc -
tor
M
] and gozdar [forester
M
], we add the gender difference vectors, and depict the resulting near -
est female occupations (analogies), i.e. (gozdarka [forester
F
] and vrtnarka [gardener
F
]; režiserka 
[film director
F
] and scenaristka [scriptwriter
F
]). The difference to the expected non-biased point 
is larger for the gozdar - gozdarka pair.
Our study makes certain simplifications. First, we are not paying attention to 
non-binary expressions of gender, for example we do not specifically address 
the references such as on/ona or a newly proposed form introduced to be more 
inclusive of nonbinary gender identities on_a (Kern and Dobrovoljc, 2017) or 
noun writings of type učitelj/učiteljica (and učitelj_ica). Next, for many pro -
fessions, the male form can be used as a general reference for a profession 
regardless of gender and we do not make any distinction between mentions 
of occupations when relating to a male representative or using a general men -
tion (note also that unmarkedness of the masculine form in terms of gender 
is not anymore universally accepted (Kern and Dobrovoljc, 2017; Popič and 
Slovenscina_2_2021_1 korekture3.indd   29 Slovenscina_2_2021_1 korekture3.indd   29 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
30 31
Slovenščina 2.0, 2021 (1)
Gorjanc, 2018)). As we analyse and compare the gender bias between differ -
ent embedding models, these are not severe limitations, as all the embedding 
models are treated equally. Moreover, similar studies on languages where the 
gender of a noun is not expressed morphologically can run into more serious 
problems (see the warnings by Nissim et al. (2019)).
The main contribution of the paper is the evaluation of Slovene and Croatian 
word embedding models in terms of gender, which has not yet been suffi -
ciently researched (the exception being the analysis of the Slovene w2v model 
in Supej et al. (2019) and Croatian evaluation of embeddings in Svoboda and 
Beliga (2018)). The paper extends our work (Supej et al., 2020), where we 
focused on quantitative evaluation and comparison of a wide range of Slo -
vene models and different approaches to evaluation, while in this paper, we 
extend the work and also compare Croatian word embeddings models. The 
focus of the paper is to draw the attention of the developers of linguistic and 
technological tools (which are based on word embeddings) to the implications 
the usage of biased embeddings might have. Despite indirectly problematising 
language bias and pointing out several stereotypical associations, a detailed 
critical interpretation falls out of this paper’s scope.
The paper is divided into further six sections. We first present related work 
(Section 2). Section 3 describes Slovene and Croatian lists of male and female 
occupations and specifies the word embedding models used. In Sections 4 
and 5, methodology and results are addressed, followed by a discussion in 
Section 6, and conclusions with plans for further work in Section 7.
2 RELATED WORK
Language corpora and datasets reflect linguistic variations (including different 
types of bias) in relation to social factors. NLP tools are trained on these data and 
can inherit the contained variations and biases. The bias in corpora can negative -
ly impact NLP tools (Sun et al., 2019) and can perpetuate biases held towards cer -
tain groups. Word embeddings are trained on large corpora to capture syntactic 
and semantic relations between words and capture the expressed biases.
For instance, it has been shown that standard training data sets for part-of-
speech perform better on older people’s language (Hovy and Søgaard, 2015). 
Slovenscina_2_2021_1 korekture3.indd   30 Slovenscina_2_2021_1 korekture3.indd   30 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
30 31
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
Garimella et al. (2019) show that a part-of-speech tagger and a dependency 
parser perform successfully on texts written by women, regardless of what 
data they had been trained on initially. On the other hand, male authors’ texts 
are better tagged/parsed when the training data contained enough texts writ -
ten by men. The success of tools such as parsers on male authors’ texts may be 
due to the imbalances in the training data favouring male authorship. It has 
also been shown that NLP tools are more effective when demographic varia -
tions are considered (Volkova et al., 2013; Hovy, 2015). Hovy (2015) shows 
that including the information on the age and gender of authors improves the 
performance of three tasks in five different languages.
Biases can have negative consequences in the coreference resolution task 
(Zhao et al., 2018) and can perpetuate biases held towards certain groups 
(see examples in Zhao et al., 2017). In the context of texts on mental illness, 
Hutchinson et al. (2020) note that topics such as gun violence, homeless -
ness, and addiction are over-represented, leading to disability topics receiving 
particularly negative scores in sentiment analysis tasks. Besides the aspects 
above, some authors call the attention to the effect biases can have on detec -
tion tools. For example, misogyny detection models may attribute high scores 
to non-misogynous texts simply because the latter contain the so-called iden -
tity terms, i.e. terms associated with misogyny (Nozza et al., 2019). In sum, 
the interplay of bias and NLP is an important and interesting field receiving 
increasing attention, notably regarding word embeddings, as explained next.
In terms of word embeddings, researchers have studied bias by investigating 
the proximity of gender-related words to other words in the vector space. For 
example, Garg et al. (2018) show that the adjective honourable lies closer to 
the word man than to the word woman. Second, biases are reflected in analo -
gies, e.g. Bolukbasi et al. (2016) show that the embedding space solution of the 
analogy man:computer programmer ≈ woman:x is x = homemaker. Nissim 
et al. (2019) warn that such analogies overemphasise the practical impact of 
the biases.
As already mentioned, gender bias in word embeddings is often studied on 
analogies of occupations, which is also our study’s case. In morphologically 
rich languages, such as Slovene and Croatian, the gender of words is expressed 
morphologically. Therefore, the result of the gender analogy is expected to be 
Slovenscina_2_2021_1 korekture3.indd   31 Slovenscina_2_2021_1 korekture3.indd   31 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
32 33
Slovenščina 2.0, 2021 (1)
the female form of the male variant of the occupation (and vice versa). Svo -
boda and Beliga (2018) included masculine and feminine versions of job po -
sitions in Croatian as one of the evaluation aspects of Croatian word2vec and 
fastText word embeddings. Preliminary research on word2vec embeddings in 
Slovene (Supej et al., 2019) showed that the analogy task’s accuracy is reason -
ably high both when attempting to find the female and the male equivalent of 
an occupation. Results nevertheless reflect gender biases: the first result of the 
analogy woman:secretary ≈ man:x is x = boss, while the first ten results of 
different analogies indicate other gender inequalities: the association of wom -
en with house chores and men with occupations of a higher status etc. In the 
work of Supej et al. (2020) that we extend in this paper, different word2vec, 
fastText and ELMo embeddings are compared on Slovene pairs of male and 
female occupations. 
As tools based on biased word embeddings may reinforce biases (Zhao et al., 
2017), many research groups focused on debiasing word embeddings: the main 
goal of such algorithms is to prevent language models from reproducing racist, 
sexist or in other ways harmful content. Debiasing also has other advantages – 
it has been shown that debiasing contributes to correct coreference resolution 
(Zhao et al., 2018). Some examples of these methods are equalising the dis -
tances between gender-specific words and occupations (Bolukbasi et al., 2016; 
Bordia and Bowman, 2019), inserting additional restrictions into the training 
corpus (e.g. ensuring equal representation of occupational activities between 
the genders in the training data) (Zhao et al., 2017), removing texts that cause 
bias (Brunet et al., 2019), and training gender-neutral word embeddings (Zhao 
et al., 2018). Schick et al. (2021) recently proposed a self-diagnosis and self-de -
biasing model where large language models examine their outputs regarding 
the potential presence of undesirable attributes. They introduced a debiasing 
algorithm that reduces the likelihood of a model producing biased text. More -
over, researchers recently also focused on methods for debiasing sentence 
representations, addressing the difficulty of retraining models that are often 
proposed in debiasing research (retraining models like BERT and ELMo often 
proves infeasible in practice) (Liang et al., 2020). Gonen and Goldberg (2019) 
caution that many debiasing methods only conceal bias, which continues to 
be present in the embeddings, and that many metrics used in the debiasing 
Slovenscina_2_2021_1 korekture3.indd   32 Slovenscina_2_2021_1 korekture3.indd   32 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
32 33
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
research have only positive predictive ability (i.e. they can detect the presence 
of bias but not its absence). On the other hand, studies such as Hirasawa and 
Komachi (2019) show that debiasing improves multimodal machine transla -
tion, thereby underlining the promising future of this research field. In our 
study, we do not aim to debias embeddings but only compare different embed -
ding approaches in Slovene and Croatian concerning their gender bias.
3 DATA
In this section, we first present the lists of occupations in Slovene and Cro -
atian we used to analyse gender biases, followed by the embedding models. 
3.1 List of occupations
We first describe the list of occupations we collected for Slovene, followed by 
its equivalent in Croatian. Our selection of occupations in Slovene is based 
on the Standard Classification of Occupations (Vlada RS, 1997), based on the 
International Standard Classification of Occupations. Most occupations in 
this classification are multi-word expressions (e.g. upravljalec/upravljalka 
metalurškega žerjava [en. metallurgical crane operator]), which are less 
suitable for computation with embeddings due to their specificity and length. 
To calculate analogies, we limit our approach to single-word occupations. The 
complete list of single-word occupations in Slovene includes 422 male/female 
occupation pairs, further reduced in line with the following criteria:
1. An occupation has to exist both in female and male grammatical gen -
der (gender-neutral words such as pismonoša [en. postman] are not 
included in the list).
2. An occupation as a common noun occurs at least 500 times in the Cor -
pus of Written Standard Slovene Gigafida 2.0 (2020).
3. When a more established version of the occupation exists, we manu -
ally add a synonym with the same root (e.g. in the case of fotografka, 
an arguably more established fotografinja was added [en. photogra-
pher]). When calculating analogies, the form more frequent in the cor -
pora is inserted at the input, but all synonyms (if they appear among 
the results) are considered a correctly solved analogy.
Slovenscina_2_2021_1 korekture3.indd   33 Slovenscina_2_2021_1 korekture3.indd   33 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
34 35
Slovenščina 2.0, 2021 (1)
4. If the standard classification does not include the female (e.g. drama-
tik [en. playwright]) or male variant (e.g. prostitutka [en. prostitute]) 
of the occupation, the missing version is manually added if it exists and 
appears in the Gigafida corpus (e.g. there are no established words for 
female and male versions of postrešček [en. porter] and hostesa [en. 
hostess], respectively).
5. Occupations where either the female or the male occupation variant 
is a homograph (e.g. detektivka [en. detective] also denotes a detec-
tive novel) or where an occupation could be associated with a con -
text unrelated to occupations (e.g. čarovnik/čarovnica [en. wizard/
witch]), were excluded from the final set of occupations. Likewise, we 
filtered out occupations that are also proper names, such as kovač [en. 
blacksmith]; for differentiating between common nouns and proper 
names Sloleks 2.0 (Dobrovoljc et al., 2019) was used. The final list 
contains 234 occupation pairs and is freely accessible in the CLARIN 
repository
2
.
For Croatian, we compiled a list of occupations from two existing sources. The 
first source contains occupations from the word analogy dataset by Svoboda 
and Beliga (2018). It consists of 109 pairs of single-word occupations. The sec -
ond source is ESCO (European Skills, Competences, Qualifications and Occupa -
tions)
3
 and lists 2942 occupations in male and female form. Similar to the Slo -
vene list of occupations, most of the classifications from ESCO are multi-word 
expressions, e.g. špediterski službenik / špediterska službenica za uvoz i izvoz 
riba, rakova i mekušaca [en. import-export specialist in fish, crustaceans and 
molluscs]. After removing all multi-word occupations, the ESCO source con -
tains 309 pairs of single-word occupations. The final, combined list from both 
sources, filtered to remove duplicates, contains 375 occupation pairs.
3.2 Word embedding models
Different configurations of word embeddings for Slovenian and Croatian were 
used in the experimental phase. We first list the Slovene embedding models 
followed by the Croatian ones.
2 http://hdl.handle.net/11356/1347 
3 https://ec.europa.eu/esco/portal 
Slovenscina_2_2021_1 korekture3.indd   34 Slovenscina_2_2021_1 korekture3.indd   34 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
34 35
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
3.2.1 Slovene word embedding modelS
We analyse two non-contextual embedding models, fastText and word2vec, 
and the ELMo contextual model.
• fastText (Bojanowski et al., 2017):
– 100-dimensional vectors, trained on Gigafida 2.0 in the EU EM -
BEDDIA
4 
project,
– 300-dimensional vectors, trained as above,
– 100-dimensional word vectors from the Sketch Engine portal 
(word),
– 100-dimensional word vectors from the Sketch Engine portal, 
where vectors are embeddings of word lemmas,
– 100-dimensional CLARIN.SI-embed.sl vectors (Ljubešić and Er -
javec, 2018), and
– 300-dimensional vectors from the fastText.cc portal;
• word2vec (Mikolov et al., 2013a): 256-dimensional vectors, trained 
for the needs of the Kontekst.io portal (Plahuta, 2020); available at 
request
5
;
• ELMo (Peters et al., 2018): 1024-dimensional vectors, contextual em -
beddings built in the EU EMBEDDIA project, trained on Gigafida (Ul -
čar, 2019). Contextual embeddings produce a different vector for each 
occurrence of the word based on its context. We computed word vec -
tors from sentences in Slovene Wikipedia. To get a single representa -
tion for each word, comparable to other embeddings, for each of the 
200,000 most common words, we calculated the centroid vector of all 
word occurrences. Several different types of vectors were used:
– vectors from the output of the first (CNN) layer of the network that 
is context-independent (i.e. layer 0),
4 http://embeddia.eu/ 
5 https://kontekst.io/kontakt 
Slovenscina_2_2021_1 korekture3.indd   35 Slovenscina_2_2021_1 korekture3.indd   35 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
36 37
Slovenščina 2.0, 2021 (1)
– vectors from the output of the second (first LSTM) layer of the 
network that is context-dependent (i.e. layer 1),
– vectors from the output of the third (second LSTM) layer of the 
network that is context-dependent (i.e. layer 2).
3.2.2 Croatian word embedding model
For the Croatian language, we analyse several non-contextual embedding 
models: 
• fastText (Bojanowski et al., 2017):
– 100-dimensional vectors, trained in the EU EMBEDDIA project,
– 300-dimensional vectors, trained as above,
– 100-dimensional CLARIN.SI-embed.hr vectors of words and lem -
mas (Ljubešić, 2018),
– 300-dimensional vectors from the fastText.cc portal.
4 EVALUATION METHODOLOGY
To assess the gender bias for each of the embedding models and each occu -
pation, we calculated occupational analogies in four ways. However, the core 
analogy computation is the same in all cases: for every occupation of a mascu -
line grammatical gender O
m
, we search for a feminine noun equivalent O
f
. The 
following vector is calculated:
v(d) = v(O
m
) - v(m) + v(f),
where v(m) is the male vector, and v(f) is the female vector. If there were no 
gender biases, v(d) would be equal or very similar to v(O
f
). For every vector 
v(d), we find N closest word vectors according to the cosine similarity (we 
use N = 1, 5, or 10). When searching for closest words, all words appearing in 
the embeddings are considered, except for the words man, woman, the word 
O
m
, and the words containing non-alphabetic characters (numbers, hyphens, 
punctuation etc.). If the word O
f
 is located among the N-closest words, we 
consider the analogy correct; else it is marked as incorrect. We convert all 
letters to lowercase: e.g. the words Zdravnik, zdravnik and ZDRAVNIK are 
Slovenscina_2_2021_1 korekture3.indd   36 Slovenscina_2_2021_1 korekture3.indd   36 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
36 37
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
all converted to zdravnik and thus considered the same word. The process 
is repeated for each female variant of an occupation O
f
 where we look for the 
male equivalent O
m
. Here, the vector v(d) is calculated as:
v(d) = v(O
f
) - v(f) + v(m).
When looking for closest words, O
f
 is omitted from the set of words, just as O
m 
was ignored before. The final result represents the proportion of correctly de -
termined cases. The metric is called precision at N (P@N). A higher N allows 
for finding additional closest hits in the vector space.
Two approaches were used to determine the baseline male vector v(m) and 
female vector v(f):
• The first approach defines m simply as the word man and f as wom-
an (in Slovene corresponding to moški and ženska and in Croatian to 
muškarac and žena). 
• In the second approach, similarly to Bolukbasi et al. (2016), the dif -
ference v(f) −v(m) or v(m) −v(f) is defined as the average difference 
of vectors of word pairs which refer specifically to a woman or man 
(Table 1).
Table 1: Inherently male-female word pairs in Slovene (left) and Croatian (right)
Slovene male-female word pairs Croatian male-female word pairs
m f m f
moški [man] ženska [woman] muškarac [man] žena [woman]
gospod [sir] gospa [madam] gospodin [sir] gosopođa [madam]
fant [boy] dekle [girl] momak [boy] djevojka [girl]
deček [boy] deklica [girl] dječak [boy] djevojčica [girl]
brat [brother] sestra [sister] brat [brother] sestra [sister]
oče [father] mati [mother] otac [father] majka [mother]
sin [son] hči [daughter] sin [son] kći [daughter]
dedek [grandfather] babica [grandmother] djed [grandfather] baka [grandmother]
mož [husband] žena [wife] suprug [husband] supruga [wife]
on [he] ona [she] on [he] ona [she]
fant [boy] punca [girl] tata [dad] mama [mum]
stric [uncle] teta [aunt]
Slovenscina_2_2021_1 korekture3.indd   37 Slovenscina_2_2021_1 korekture3.indd   37 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
38 39
Slovenščina 2.0, 2021 (1)
When searching for the N closest words, we also tested lemmatisation’s influ -
ence: in this case, all words in word embeddings were lemmatised using the 
LemmaGen
6 
tool. By doing so, the effect of different word forms stemming 
from, e.g. conjugation and declination, was offset: for example, word forms 
zdravnico and zdravnice are considered a single near word since they share 
the same lemma zdravnica [doctor
F
].
5 RESULTS
We present the results showing biases in all embeddings described in Section 
3. We use the P@N measure, where N equals 1, 5, or 10. Some of the occu -
pations from our list are not covered by all word embeddings, i.e. there is no 
word vector for them. Any example where the searched-for word is not among 
the top N closest words is counted as incorrect, even if the searched-for word 
does not appear in the embeddings. In cases where the embeddings do not 
cover the input occupation, and we cannot calculate the vector v(d), we dis -
miss all such examples so that they do not affect the final result. The reader, 
interested in the results where non-covered examples are also considered, is 
referred to our conference paper (Supej et al., 2020).
The results for Slovene analogies are presented in Table 2 and for the Croatian 
analogies in Table 3 . Results for experiments where we have a masculine ex -
pression for the occupation O
m
 as the input, and we search for the equivalent 
feminine expression of the same occupation O
f
, are shown in the rightmost 
columns ( m input) for each language. Results, where we have O
f
 as the input 
and search for O
m
, are shown in leftmost columns ( f input) for each language. 
As explained in Section 4, we tested different approaches. The approaches 
where we lemmatised all the words or used the average difference of vectors 
of pairs of words from Table 1 generally perform better (i.e. they express lower 
gender bias). These two options have the suffixes lem and avg appended in the 
tables, respectively. In this section, we only show the results for applying both 
of these options (we do not apply lemmatisation to fastText (lemma) embed -
dings as they are already lemmatised). Full results are presented in Appen -
dix A in Table 8 for Slovenian and in Table 9 for Croatian.
6 https://github.com/vpodpecan/lemmagen3/ 
Slovenscina_2_2021_1 korekture3.indd   38 Slovenscina_2_2021_1 korekture3.indd   38 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
38 39
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
Table 2: Results for all Slovenian embeddings
Slovene word embeddings
dimensions
and approach
f input m input
P@1 P@5 P@10 P@1 P@5 P@10
ELMo Embeddia
1024D l0 lem avg 0.907 0.933 0.947 0.370 0.398 0.403
1024D l1 lem avg 0.907 0.947 0.947 0.381 0.392 0.398
1024D l2 lem avg 0.880 0.933 0.933 0.376 0.398 0.398
fastText.cc 300D lem avg 0.613 0.884 0.948 0.655 0.755 0.764
fastText Embeddia
100D lem avg 0.906 0.971 0.976 0.677 0.720 0.724
300D lem avg 0.947 0.976 0.982 0.685 0.720 0.724
fastText CLARIN.SI-embed.sl 100D lem avg 0.839 0.940 0.950 0.761 0.880 0.902
fastText Sketch Engine (word)100D lem avg 0.930 0.962 0.973 0.725 0.781 0.785
fastText Sketch Engine (lemma) 100D avg 0.673 0.931 0.960 0.598 0.786 0.821
word2vec Kontekst.io 256D lem avg 0.679 0.853 0.872 0.407 0.550 0.593
Note. Results for each approach, where we have a feminine word for occupation on the input ( f 
input), and we search for the equivalent masculine term, and where we have a masculine word for 
occupation on the input ( m input), and we search for the equivalent feminine term. The examples 
where the embeddings do not cover the input occupation were dismissed. The best result in each 
column is in bold.
Table 3: Results for all Croatian embeddings
Croatian word embeddings dimensions
and approach
f input m input
P@1 P@5 P@10 P@1 P@5 P@10
fastText.cc 300D lem avg 0.731 0.939 0.954 0.546 0.637 0.644
fastText Embeddia
100D lem avg 0.905 0.941 0.968 0.625 0.666 0.672
300D lem avg 0.923 0.982 0.986 0.631 0.675 0.678
fastText CLARIN.SI-embed.hr 
(word)
100D lem avg 0.907 0.930 0.944 0.673 0.746 0.754
fastText CLARIN.SI-embed.hr 
(lemma)
100D avg 0.244 0.678 0.826 0.266 0.521 0.588
Note. For each approach, where we have a feminine word for occupation on the input ( f input) and 
we search for the equivalent masculine term, and where we have a masculine word for occupation 
on the input ( m input) and we search for the equivalent feminine term. The examples where the 
embeddings do not cover the input occupation were dismissed. The best result in each column is 
in bold.
The results show that both lemmatisation of the words and using the aver -
age of several inherently male or female words for male and female vectors 
improve the reported scores. Applying both approaches gives the best results 
in most cases. For finding the closest N words, we have also tried the CSLS 
Slovenscina_2_2021_1 korekture3.indd   39 Slovenscina_2_2021_1 korekture3.indd   39 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
40 41
Slovenščina 2.0, 2021 (1)
measure (Cross-Domain Similarity Local Scaling) (Conneau et al., 2018) in -
stead of the cosine similarity. This measure avoids the problem of hubness in 
the search for nearest neighbours. Namely, some words (called hubs in the 
nearest neighbour graph representation) may be nearest neighbours of many 
other words, while others are nearest neighbours of no other word (outliers). 
CSLS computes nearest neighbours in both directions and largely avoids the 
problem of hubness. For the experiments with O
f
 on the input and searching 
for O
m
, there is no significant difference in results between the cosine similar -
ity and CSLS. For the experiments with O
m
 on the input and searching for O
f
, 
using CSLS gives lower precision than the cosine similarity. This is especially 
the case where we used the words “man” and “woman” for vectors v(m) and 
v(f). When using averages of several inherently male and female words for 
vectors v(m) and v(f), the difference in precision between the cosine similarity 
and CSLS is smaller, but the cosine similarity still outperforms CSLS.
We give a more detailed discussion of the results for each approach in the next 
section. We only present the results of the cosine similarity measure.
6 DISCUSSION
In the case of Slovene word embeddings, the fastText CLARIN.SI-embed.sl 
embeddings reach the highest precision in the analogy task for male versions 
of occupations at the input (Table 2). When there are female versions of occu -
pations at the input, the embedding model reaching the highest precision is 
fastText Embeddia. Similar results are observed for Croatian embeddings (Ta -
ble 3). Lemmatisation of the output and averaging several inherently male and 
female words for vectors v(m) and v(f) (instead of using only the embeddings 
for woman or man) improves the precision in the analogy task for different 
models and different input data. As described in Section 5, we dismiss the ex -
amples where the embeddings do not cover the input occupation. If we do not 
dismiss these examples but instead count them as incorrect, the share of oc -
cupations covered by the embeddings has the largest effect on the score. The 
results for Slovene can be found in our paper (Supej et al., 2020). The fastText 
CLARIN.SI embeddings would then score the best, as these embeddings cover 
the occupations best. This is especially important for the female occupations 
since they have much lower coverage than male occupations.
Slovenscina_2_2021_1 korekture3.indd   40 Slovenscina_2_2021_1 korekture3.indd   40 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
40 41
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
Results in Table 2 and Table 3 have been filtered, so that the words man, 
woman and the occupation on the input are removed from the list of analogy 
results, as explained in Section 4. With unfiltered results, the input occupation 
is often the result of the analogy task (Table 4). For more detailed results (not 
only with lemmatisation and using several inherently male and female words 
for v(m) and v(f)) see Table 10 in Appendix A.
With the fastText Embeddia model, we reach similar results using 100- and 
300-dimensional vectors (see Table 2 and Table 3). Other embeddings are 
not directly comparable with regards to dimensionality as they were trained 
on different resources. However, corpora used to train the embeddings play a 
more important role than the number of dimensions. The FastText Embeddia 
model in Table 4 shows that dimensionality plays a role in determining how 
often the input occupation is the result of the analogy. In a different setup, 
when considering the occupations that are not covered in the embeddings, 
dimensionality strongly influences the results (Supej et al., 2020).
Table 4: Share of cases where the result of the analogy with the highest cosine similarity is the 
input occupation itself - before filtering is done to produce the results in Table 2 and Table 3 
(both male to female and female to male analogies)
Slovene word 
embeddings
Dimensions 
and  
approach
Share of 
outputs 
equal to 
inputs
Croatian word 
embeddings
Dimensions 
and 
approach
Share of 
outputs 
equal to 
inputs
ELMo Embeddia 1024D l0 lem avg 0.547
1024D l1 lem avg 0.423
1024D l2 lem avg 0.064
fT fastText.cc 300D lem avg 0.831 fT fastText.cc 300D lem avg 0.672
fT Embeddia 100D lem avg 0.143 fT Embeddia 100D lem avg 0.094
300D lem avg 0.419 300D lem avg 0.352
fT CLARIN.SI-embed.sl 
(word)
100D lem avg 0.316 fT CLARIN.SI-embed.
hr (word)
100D lem avg 0.103
fT Sketch Engine (word)100D lem avg 0.096
fT Sketch Engine (lemma)100D avg 0.803 fT CLARIN.SI-embed.hr 
(lemma)
100D avg 0.837
w2v Kontekst.io 256D lem avg 0.483
Note. The number of all cases is 468 (from 234 occupation pairs) for Slovene and 750 (from 375 
occupation pairs) for Croatian.
Slovenscina_2_2021_1 korekture3.indd   41 Slovenscina_2_2021_1 korekture3.indd   41 30. 06. 2021   07:56:31 30. 06. 2021   07:56:31
42 43
Slovenščina 2.0, 2021 (1)
The coverage of masculine occupations is higher than that of feminine occupa -
tions in all word embedding models (Table 5). FastText CLARIN.SI-embed.sl 
word embeddings achieve the highest coverage of female occupations, while 
ELMo word embeddings contained only 75 of the 234 female occupations. 
As explained in Section 3.2.1, ELMo embeddings are limited to only 200,000 
most common words in Wikipedia; therefore, we have significantly lower cov -
erage of occupations for ELMo. For comparison, other word embedding mod -
els cover around 1 million words. Masculine occupations that do not appear 
in the embeddings are typically occupations associated with women (e.g. male 
variants of seamstress and cosmetician, in Slovene šiviljec and kozmetik, re-
spectively). Likewise, feminine occupations not present in the embeddings 
are traditionally male occupations (e.g. embedding models do not contain fe -
male variants of occupations like auto mechanic and carpenter (in Slovene 
avtomehaničarka and tesarka, respectively), or occupations that have been 
culturally taken up exclusively by men, e.g., nadškof (en. archbishop). Poor 
representation of female occupations can also be attributed to other factors ― 
Zhao et al. (2018) report that the mentions referring to men are more likely to 
contain a job title compared to female mentions.
Table 5: Coverage of male (m) and female (f) occupations from the list in different embeddings 
as a ratio between covered occupations and all occupations
Slovene embeddings m f Croatian embeddings m f
ELMo 0.774 0.321
fastText cc 0.979 0.739 fastText cc 0.848 0.527
fastText Embeddia 0.991 0.726 fastText Embeddia 0.856 0.594
fastText CLARIN.SI-embedd.sl 1.000 0.932 fastText CLARIN.SI-embedd.hr 
(word)
0.914 0.722
fastText Sketch Engine (word) 0.996 0.791 fastText CLARIN.si-embedd.hr 
(lemma)
0.955 0.722
fastText Sketch Engine (lemma) 1.000 0.863
word2vec Kontekst.io 0.987 0.667
Nissim et al. (2019) claim that most studies exaggerate biases pointed out 
by analogy tasks. The design of these studies excludes the input occupation 
from the possible results, even if the calculations could lead to this exact oc -
cupation to have the highest cosine similarity and hence appear in the results. 
This criticism is more relevant for English studies as in Slovene the gender in 
Slovenscina_2_2021_1 korekture3.indd   42 Slovenscina_2_2021_1 korekture3.indd   42 30. 06. 2021   07:56:32 30. 06. 2021   07:56:32
42 43
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
occupations is for the most part expressed by word morphology. Even though 
we omitted the input occupations from the results, which is a standard prac -
tice when calculating analogies, we analysed the results before this filtering. 
Analysis of the results showed that the input occupation is indeed often the re -
sult with the highest cosine similarity (Table 4), varying significantly between 
different models.
When manually comparing the results of different models from Tables 2 and 
3, we also notice several differences between the models. In the case of ELMo 
and word2vec models, the outputs are largely occupations. The results of 
the analogy task in the case of fastText Embeddia, CLARIN.SI-embed.sl and 
Sketch Engine (word) are occupations, as well as words related to the occupa -
tion on the input, or words that share the same root as the input occupation. 
Results of the fastText.cc and Sketch Engine (lemma) models are typically 
words sharing the root with the input occupation. 
Analogy results are interesting from a semantic point of view. The first results 
of the analogy task (Slovene “fastText Embeddia 100D lem avg”) ženska:kro-
jačica :: moški:x being x=krojač [en. woman:tailor
F
 :: man:tailor
M
] and žen-
ska:šivilja :: moški:x being x=krojač [en. woman:seamstress :: man:tailor] 
are interesting. For example, while word embedding of šiviljec [en. seamster] 
is not available, krojač [en. tailor], a semantically linked one, from anoth -
er morphological word family is. Another interesting element is illustrated 
by one of the results of the analogy: ženska:manekenka :: moški:x where 
x=nogometaš [en. woman:model :: man:footballer] (Croatian “fastText Em -
beddia 100D lem avg”). While model and footballer are not corresponding 
to the same professions, this result is an indication that female models and 
male footballers appear in similar textual contexts. It would be interesting to 
investigate those contexts further (e.g. both occupations represent desirable 
identities, such as being beautiful, rich, famous, successful). 
There are indeed more examples where results of certain analogies (espe -
cially in the case of “word2vec Kontekst.io lem avg model”) are not linked 
to the input occupation or are stereotypical. For example, the results of the 
analogy moški:rudar :: ženska:x in the aforementioned w2v model are, e.g. 
barbika [en. barbie], klovnesa [en. clown
F
], čarovnica [en. witch], lutka [en. 
doll], prostitutka [en. prostitute
F
], akrobatka [en. acrobat
F
], najstnica [en. 
Slovenscina_2_2021_1 korekture3.indd   43 Slovenscina_2_2021_1 korekture3.indd   43 30. 06. 2021   07:56:32 30. 06. 2021   07:56:32
44 45
Slovenščina 2.0, 2021 (1)
teenager
F
], opica [en. monkey], princeska [en. princess], striptizeta [en. 
stripper
F
]. The case of stereotypical analogies in the w2v model is pointed out 
by Supej et al. (2019).
As part of the analysis, a frequency list of analogy results for female and male 
input occupations was compiled for each word embedding model (only the 
lem avg configuration of the models was taken into account) (see Table 6 for 
Slovene and Table 7 for Croatian).
The most frequently occurring words mostly follow the pattern that for a male 
occupation on the input, a female occupation is expected on the output. Pre -
sented Slovene embedding models follow this pattern; in the case of the Cro -
atian embeddings, there are several examples among the frequently occurring 
words that do not follow the pattern: in the “fastText cc lem avg” with a female 
occupation on the input, there are several frequently occurring female occu -
pation variants also on the output, e.g. ethicist, biologist (etičarka, biologinja, 
respectively). For etičarka, it is possible that this result is influenced by oth -
er similar words (e.g. kozmetičarka), as fastText models consider subword 
information. The most frequently occurring words are primarily occupations 
but not always – for example, female Scottish national ( Škotkinja) and father 
(otac) frequently appear in the Croatian “fastText cc lem avg” model while 
one of the frequent words in the Slovene “word2vec Kontekst.io lem avg” is 
korenjak (denoting a brave man).
In Slovene word embeddings, we notice a pattern of the most frequently oc -
curring feminine occupations/words appearing more often than the most fre -
quently occurring male occupations in the “ELMo l2 lem avg” and “w2v Kon -
tekst.io lem avg” models. Similar is observed for Croatian models presented 
in Table 7; however, the most frequently occurring words appear less often 
than in the Slovene embeddings. One possible explanation is that the models 
mentioned above contain fewer word embeddings than some other models 
(200,000 or approximately 600,000 for each model). Both models exhibit a 
lower representation of the female versions of occupations in the embeddings. 
Occupations that nevertheless appear in the embeddings, therefore, reappear 
more often. There are overall more male occupations in the embeddings, pos -
sibly causing individual male occupations to come up less frequently than fe -
male ones.
Slovenscina_2_2021_1 korekture3.indd   44 Slovenscina_2_2021_1 korekture3.indd   44 30. 06. 2021   07:56:32 30. 06. 2021   07:56:32
44 45
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
Table 6: Most common words that appear among the top 10 results of the analogy task (that is, among the 10 closest words to the searched-for 
term, based on the cosine similarity measure) for selected Slovene embedding models
ELMo Embeddia l2 lem avg fastText CLARIN.SI lem avg word2vec Kontekst.io lem avg
m input f input m input f input m input f input
Result n Result n Result n Result n Result n Result n
bolničarka 
[nurse]
47 geograf
[geographer
M
]
9 šivilja
[seamstress]
15 mizar
[carpenter
M
]
11 kuharica
[cook
F
]
44 ortoped
[orthopedist
M
]
14
biokemičarka
[biochemist
F
]
39 politolog
[political scientist
M
]
8 ključavničarka
[locksmith
F
]
11 biology
[biologist
M
]
10 gospodinja
[homemaker
F
]
38 pisatelj
[writer
M
]
14
frizerka
[hairdresser
F
]
39 biolog
[biologist
M
]
7 inštalaterka
[installer
F
]
9 ključavničar
[locksmith
M
]
9 šivilja
[seamstress]
33 kardiolog
[cardiologist
M
]
13
trgovka
[salesperson
F
]
39 dramaturg
[playwright
M
]
7 keramičarka
[ceramist
F
]
9 zgodovinar
[historian
M
]
9 frizerka
[hairdresser
F
]
32 nevrolog
[neurologist
M
]
13
čistilka
[cleaner
F
]
34 književnik
[writer
M
]
7 filologinja
[philologist
F
]
8 internist
[internist
M
]
8 kozmetičarka
[cosmetician
F
]
30 urolog
[urologist
M
]
13
znanstvenica
[scientist
F
]
34 scenarist
[screenwriter
M
]
7 oftalmologinja
[ophthalmologist
F
]
8 režiser
[director
M
]
8 čistilka
[cleaner
F
]
29 psihiater
[psychiatrist
M
]
12
kuharica
[cook
F
]
33 animator
[animator
M
]
6 filozofinja
[philosopher
F
]
7 arheolog
[archeologist
M
]
7 fotografinja
[photographer
F
]
29 ekolog
[ecologist
M
]
11
geologinja
[geologist
F
]
30 esejist
[essayist
M
]
6 geofizičarka
[geophysicist
F
]
7 natakar
[waiter
M
]
7 zdravnica
[doctor
F
]
29 hišnik
[janitor
M
]
11
perica
[laundress]
28 etnolog
[ethnologist
M
]
6 kmetica
[farmer
F
]
7 pisatelj
[writer
M
]
7 služkinja
[maid]
26 biolog
[biologist
M
]
10
služkinja
[maid]
28 fotograf
[photographer
M
]
6 nevrokirurginja
[neurosurgeon
F
]
7 primarij
[senior doctor
M
]
7 trgovka
[salesperson
F
]
26 korenjak
[brave man]
10
biologinja
[biologist
F
]
27 illustrator
[illustrator
M
]
6 strugarka
[worker using a 
planer machine
F
]
7 stomatolog
[stomatologist
M
]
7 slikarka
[painter
F
]
25 maneken
[model
M
]
10
gospodinja
[homemaker
F
]
26 lutkar
[puppeteer
M
]
6 geologinja
[geologist
F
]
6 tesar
[carpenter
M
]
7 tajnica
[secretary
F
]
25 režiser
[director
M
]
10
matematičarka
[mathematician
F
]
26 paleontolog
[paleontologist
M
]
6 hematologinja
[hematologist
F
]
6 fotoreporter
[photojournalist
M
]
6 veterinarka
[veterinarian
F
]
25 akademik
[academic
M
]
9
mikrobiologinja
[microbiologist
F
]
26 pravnik
[jurist
M
]
6 kardiologinja
[cardiologist
F
]
6 gostilničar
[innkeeper
M
]
6 znanstvenica
[scientist
F
]
25 akademski slikar
[academic painter
M
]
9
arheologinja
[archeologist
F
]
25 režiser
[director
M
]
6 paleontologinja
[paleontologist
F
]
6 kardiolog
[cardiologist
M
]
6 socialna delavka
[social worker
F
]
24 glasbenik
[musician
M
]
9
Slovenscina_2_2021_1 korekture3.indd   45 Slovenscina_2_2021_1 korekture3.indd   45 30. 06. 2021   07:56:32 30. 06. 2021   07:56:32
46 47
Slovenščina 2.0, 2021 (1)
Table 7: 15 most common words that appear among the top 10 results of the analogy task (that is, among the 10 closest words to the searched-for 
term, based on the cosine similarity measure) for selected Croatian embedding models
ELMo Embeddia l2 lem avg fastText cc lem avg fastText CLARIN.SI-embedd.hr (word) lem avg
m input f input m input f input m input f input
Result n Result n Result n Result n Result n Result n
krojačica
[tailor
F
]
34 povjesničar
[historian
M
]
10 kemičarka
[chemist
F
]
12 etičarka
[ethicist
F
]
8 krojačica
[tailor
F
]
31 znanstvenik
[scientist
M
]
16
automehaničarka
[auto mechanic
F
]
29 konobar
[waiter
M
]
10 vještakinja
[expert
F
]
11 otfamologinja
[ophthalmologist
 F
]
7 automehaničarka
[auto mechanic
F
]
23 biology
[biologist
M
]
16
zavarivačica
[welder
F
]
20 biolog
[biologist
M
]
9 fizičarka
[physicist
F
]
10 redatelj
[director
M
]
6 zavarivačica
[welder
F
]
22 profesor
[professor
M
]
9
keramičarka
[ceramist
F
]
16 umjetnik
[artist
M
]
8 biokemičarka
[biochemist
F
]
10 glumac
[actor
M
]
6 šivačica
[seamstress]
18 povjesničar
[historian
M
]
9
kemičarka
[chemist
F
]
15 sociolog
[sociologist
M
]
8 vozačica
[driver
F
]
9 biologinja
[biologist
F
]
6 keramičarka
[ceramist
F
]
18 konobar
[waiter
M
]
9
biokemičarka
[biochemist
F
]
15 fizioterapeut
[physiotherapist
M
]
8 pravnica
[jurist
F
]
9 paleografkinja
[paleographer
F
]
5 soboslikarica
[painter-decorator
F
]
17 genetičar
[geneticist
M
]
9
šivačica
[seamstress]
14 redatelj
[director
M
]
7 frizerka
[hairdresser
F
]
9 ihtiologinja
[ichthyologist
F
]
5 biokemičarka
[biochemist
F
]
16 redatelj
[director
M
]
8
spremačica
[maid]
14 poslovođa
[manager
F/M
]
7 masažerka
[massage therapist
F
]
8 suscenarist
[co-screenwriter
M
]
4 kemičarka
[chemist
F
]
15 poslovođa
[manager
F/M
]
8
čistačica
[cleaner
F
]
13 paleontolog
[paleontologist
M
]
7 tehničarka
[technician
F
]
7 scenografkinja
[scenographer
F
]
4 genetičarka
[geneticist
F
]
13 policajac
[police officer
M
]
8
genetičarka
[geneticist
F
]
13 književnik
[writer
M
]
7 političarka
[politician
F
]
7 otac
[father]
4 cvjećarka
[florist
F
]
12 zaposlenik
[employee
M
]
7
fizičarka
[physicist
F
]
13 geologinja
[geologist
F
]
7 matematičarka
[mathematician
F
]
7 književnik
[writer
M
]
4 biofizičarka
[biophysicist
F
]
12 umjetnik
[artist
M
]
7
astrofizičarka
[astrophysicist
F
]
13 dramaturg
[playwright
M
]
7 lutkarica
[puppeteer
F
]
7 dopukovnik
[lieutenant colonel
M
]
4 znanstvenica
[scientist
F
]
11 sociolog
[sociologist
M
]
7
šnajderica
[seamstress]
12 znanstvenik
[scientist
M
]
6 glumica
[actor
F
]
7 daktilografkinja
[typist
F
]
4 geologinja
[geologist
F
]
11 snimatelj
[cameraman]
7
mehaničarka
[mechanic
F
]
12 zaštitar
[security guard
M
]
6 trgovkinja
[salesperson
F
]
6 astrobiologinja
[astrobiologist
F
]
4 tehničarka
[technician
F
]
10 satnik
[captain
M
]
7
informatičarka
[computer scientist
F
]
12 sociologinja
[sociologist
F
]
6 terapeutkinja
[therapist
F
]
6 škotkinja
[Scottish national
F
]
3 mehaničarka
[mechanic
F
]
10 porter
[doorkeeper
M
]
7
Slovenscina_2_2021_1 korekture3.indd   46 Slovenscina_2_2021_1 korekture3.indd   46 30. 06. 2021   07:56:32 30. 06. 2021   07:56:32
46 47
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
In the case of the Slovene “ELMo l2 lem avg” and “w2v Kontekst.io lem avg” 
models, occupations of a lower social class ( čistilka [en. cleaner
F
], perica [en. 
laundress], gospodinja [en. homemaker
F
]), as well as archaic occupations 
with women in inferior roles ( služkinja [en. maid]) are observed among the 
frequent analogy results of female grammatical gender. Socially inferior oc -
cupations are rare among the most frequent male analogies. There are less 
socially inferior occupations observed among the Croatian results (exceptions 
being, e. g., the female variants of cleaner and maid (čistačica and spremači-
ca, respectively) in the “ELMo Embeddia l2 lem avg” model).
We observed that certain words (especially female occupations) appear among 
the results despite being semantically unrelated to the input occupation. Sev -
eral analogy results (especially in the case of a typical male occupation on the 
input) are unrelated to the input occupation (e.g. bolničarka [en. nurse
F
] is the 
first result of the analogy moški:rudar :: ženska:x [en. man:miner :: wom-
an:x] and šivilja [en. seamstress] the first result of the analogy moški:avtome-
hanik :: ženska:x [en. man:auto mechanic :: woman:x] in the Slovene model 
“fastText Embeddia 100D lem avg”). One explanation is that certain word em -
beddings are more “central” than the others and, therefore, the closest neigh -
bour of many other words. To check if this explanation is true, instead of the 
cosine similarity measure, we used the CSLS measure (Conneau et al., 2018) 
that considers the shared distances of N closest neighbours. We observed that 
the precision is worse when using the CSLS measure than the cosine similarity 
(Section 5), and therefore we do not report these results. However, when ob -
serving the most common words, returned as the analogy task results (Table 
6 and Table 7), the distribution of the most common words is more uniform 
when using the CSLS measure.
Direct comparison of models between Croatian and Slovene is not possible, as 
the embeddings are trained on different text corpora, and the professions used 
for analogy calculations are not the same. However, we can notice that in Cro -
atian the occupational gender bias in tested embeddings is slightly higher. In -
terestingly, the statistical data shows that the employment gap and the pay gap 
between women and men are lower in Slovenia compared to Croatia (Eurostat, 
2021). In future, it would be interesting to study if the female employment rate 
and gap, as well as the gap in salaries for the same professions between countries, 
Slovenscina_2_2021_1 korekture3.indd   47 Slovenscina_2_2021_1 korekture3.indd   47 30. 06. 2021   07:56:32 30. 06. 2021   07:56:32
48 49
Slovenščina 2.0, 2021 (1)
is correlated with the gender bias in embeddings models trained on the corre -
sponding national languages and the changes of this correlation through time.
7 CONCLUSIONS AND FURTHER WORK
We evaluated different Slovene and Croatian word embeddings on analogies 
of male and female occupations (using different configurations and approach -
es to calculate analogies). Our focus is on the quantitative evaluation, and the 
results may be informative for developers of NLP tools. The lowest gender bias 
was obtained using the fastText embeddings. In finding female analogies (male 
occupation on the input), the best performing models proved to be fastText 
CLARIN.SI-embed.sl and fastText CLARIN.SI-embed.hr for Slovene and Croa -
tian, respectively, while the best performing models for finding male analogies 
(female occupation on the input) were the respective fastText Embeddia mod -
els. The approach where averages of several inherently male and female words 
were used instead of using only the embeddings for woman or man improved 
the results. Lemmatization likewise improves the precision. With female occu -
pations at the input, the best results (P@10) of 0.982 and 0.986 are achieved 
using the “fastText Embeddia 300D lem avg” models for Slovene and Croatian, 
respectively (the examples where the embeddings do not cover the input occu -
pation were dismissed). With male occupations on the input, the best results 
of 0.902 and 0.754 are produced by the “fastText CLARIN.SI-embed.sl 100D 
lem avg” and “fastText CLARIN.SI-embed.hr 100D (lem) avg” (cases where 
the input occupation is not present among the embeddings were likewise dis -
missed). Lowest results for male input reflect lower coverage of female occupa -
tion equivalents in the embeddings model. The “fastText CLARIN.SI-embed.sl” 
and “fastText CLARIN.si-embedd.hr (lemma)” models contain the highest ratio 
of searched-for female and male occupations. The qualitative analysis identifies 
the word2vec Kontekst.io model as the model with the highest degree of gender 
bias in the results (stereotypically male/female occupations appearing among 
the results regardless of the grammatical gender of the input occupation). 
In future work, we will focus on a detailed qualitative analysis and the rela -
tionship between word embeddings, language, and social power. Moreover, 
we will align occupations in Slovene and Croatian. Further work will also en -
compass an evaluation of BERT contextual embeddings and experiments in 
Slovenscina_2_2021_1 korekture3.indd   48 Slovenscina_2_2021_1 korekture3.indd   48 30. 06. 2021   07:56:32 30. 06. 2021   07:56:32
48 49
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
other languages. The impact of the gender bias will be tested in predictive 
models on practical tasks such as the sentiment analysis.
Acknowledgments
The research was supported by the Slovene Research Agency through research 
core funding no. P6-0411 and P2-103, as well as project no. J6-2581. This pa -
per is supported by European Union’s Horizon 2020 Programme project EM -
BEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in Eu -
ropean News Media, grant no. 825153). The results of this paper reflect only 
the author's view and the Commission is not responsible for any use that may 
be made of the information it contains.
REFERENCES
Argamon, S., Koppel, M., Fine, J., & Shimoni, A. R. (2003). Gender, genre, 
and writing style in formal written texts. TEXT, 23, 321–346.
Baker, P. (2010). Will Ms ever be as frequent as Mr? A corpus-based compar -
ison of gendered terms across four diachronic corpora of British English. 
Gender & Language, 4(1), 125–149.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word 
vectors with subword information. Transactions of the Association for 
Computational Linguistics, 5, 135–146.
Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., & Kalai, A. (2016). Man 
is to computer programmer as woman is to homemaker? Debiasing word 
embeddings. Proceedings of the 30th Conference on Neural Information 
Processing Systems (NIPS’16) (pp. 4356–4364).
Bordia, S., & Bowman, S. (2019). Identifying and Reducing Gender Bias in 
Word-Level Language Models. Proceedings of the 2019 Conference of the 
North American Chapter of the Association for Computational Linguis-
tics: Student Research Workshop, (pp. 7–15).
Brunet, M. E., Alkalay-Houlihan, C., Anderson, A., & Zemel, R. S. (2019). Un -
derstanding the Origins of Bias in Word Embeddings. Proceedings of In-
ternational Conference on Machine Learning (ICML 2019).
Caldas-Coulhard, C. R., & Moon, R. (2010). ‘Curvy, hunky, kinky’: Using cor -
pora as tools for critical analysis. Discourse & Society, 21(2), 99–133.
Slovenscina_2_2021_1 korekture3.indd   49 Slovenscina_2_2021_1 korekture3.indd   49 30. 06. 2021   07:56:32 30. 06. 2021   07:56:32
50 51
Slovenščina 2.0, 2021 (1)
Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived auto -
matically from language corpora necessarily contain human biases. Sci-
ence, 356(6334), 183–186.
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., & Jegou, H. (2018). Word 
translation without parallel data. Proceedings of the International Con-
ference on Learning Representation (ICLR).
Dobrovoljc, K., Krek, S., Holozan, P., Erjavec, T., Romih, T., Arhar Holdt, Š., 
Čibej, J., Krsnik L., & Robnik-Šikonja, M. (2019). Morphological lexicon 
Sloleks 2.0. CLARIN.SI. http://hdl.handle.net/11356/1230 
Eurostat (2021). Gender statistics. Retrieved from https://ec.europa.eu/eurostat/
statistics-explained/index.php/Gender_statistics#Labour_market 
Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings 
quantify 100 years of gender and ethnic stereotypes. PNAS, 115(16).
Garimella, A., Banea, C., Hovy, D., & Mihalcea, R. (2019). Women’s syntactic 
resilience and men’s grammatical luck: Gender-bias in part-of-speech tag -
ging and dependency parsing. Proceedings of the 57th Annual Meeting of 
the ACL (pp. 3493–3498).
Gigafida 2.0. Retrieved from https://viri.cjvt.si/gigafida 
Gonen, H., & Goldberg, Y. (2019). Lipstick on a pig: Debiasing methods cover 
up systematic gender biases in word embeddings but do not remove them. 
Proceedings of NAACL-HLT 2019 (pp. 609–614).
Gorjanc, V. (2007). Kontekstualizacija oseb ženskega in moškega spola v slov -
enskih tiskanih medijih. In I. Novak-Popov (Ed.), Stereotipi v slovenskem 
jeziku, literaturi in kulturi: zbornik predavanj 43. seminarja slovenskega 
jezika, literature in culture (pp. 173–180). Ljubljana: Center za slovenšči -
no kot drugi/tuji jezik.
Hill, B., & Shaw, A. (2013). The Wikipedia gender gap revisited: Characteris -
ing survey response bias with propensity score estimation. PloS One, 8.
Hirasawa, T., & Komachi, M. (2019). Debiasing Word Embeddings Improves 
Multimodal Machine Translation. Proceedings of Machine Translation 
Summit XVII, Vol. 1 (pp. 32–42).
Hovy, D., & Søgaard, A. (2015). Tagging performance correlates with author 
age. Proceedings of the 53rd Annual Meeting of the ACL and the 7th IJC-
NLP (pp. 483–488).
Slovenscina_2_2021_1 korekture3.indd   50 Slovenscina_2_2021_1 korekture3.indd   50 30. 06. 2021   07:56:32 30. 06. 2021   07:56:32
50 51
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
Hovy, D. (2015). Demographic factors improve classification performance. 
Proceedings of the 53rd Annual Meeting of the ACL and the 7th IJCNLP 
(pp. 752–762).
Hutchinson, B., Prabhakaran, V., Denton, E., Webster, K., Zhong, Y., & 
Denuyl, S. (2020). Social Biases in NLP Models as Barriers for Persons 
with Disabilities. Proceedings of the 58th Annual Meeting of the Associa-
tion for Computational Linguistics (pp. 5491–5501).
Kern, B., & Dobrovoljc, H. (2017). Pisanje moških in ženskih oblik in 
uporaba podčrtaja za izražanje »spolne nebinarnosti«. Jezikov -
na svetovalnica. Retrieved from https://svetovalnica.zrc-sazu.si/topic/2247/
pisanje-mo%C5%A1kih-in-%C5%BEenskih-oblik-in-uporaba-pod%C4%8Drtaja-za-iz -
ra%C5%BEanje-spolne-nebinarnosti 
Kiritchenko, S., & Mohammad, S., (2018). Examining Gender and Race Bias 
in Two Hundred Sentiment Analysis Systems. Proceedings of the Sev-
enth Joint Conference on Lexical and Computational Semantics (pp. 
43–53).
Koolen, C., & van Cranenburgh, A. (2017). These are not the stereotypes you 
are looking for: Bias and fairness in authorial gender attribution. Proceed-
ings of the First Ethics in NLP workshop (pp. 12–22).
Lakoff, R. (1973). Language and woman’s place. Language in Society, 2(1), 
45–80.
Liang, P. P, Li, I. M., Zheng, E., Lim, Y. C., Salakhutdinov, R., & Morency, L. 
(2020). Towards Debiasing Sentence Representations. Proceedings of the 
58th Annual Meeting of the Association for Computational Linguistics 
(pp. 5502–5515).
Ljubešić, N., & Erjavec, T. (2018). Word embeddings CLARIN.SI-embed.sl 
1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.
net/11356/1204 
Ljubešić, N. (2018). Word embeddings CLARIN.SI-embed.hr 1.0, Slovenian 
language resource repository CLARIN.SI. http://hdl.handle.net/11356/1205 
Martinc, M., Škrjanec, I., Zupan, K., & Pollak, S. (2017). PAN 2017: Author 
profiling - gender and language variety prediction: notebook for PAN at 
CLEF 2017. Proceedings of the Conference and Labs of the Evaluation 
Forum.
Slovenscina_2_2021_1 korekture3.indd   51 Slovenscina_2_2021_1 korekture3.indd   51 30. 06. 2021   07:56:32 30. 06. 2021   07:56:32
52 53
Slovenščina 2.0, 2021 (1)
Mikolov, T., Corrado, G. S., Chen, K., & Dean, J. (2013a). Efficient estimation 
of word representations in vector space. Proceedings of the International 
Conference on Learning Representations (pp. 1–12).
Mikolov, T., Yih, W-t., & Zweig, G. (2013b). Linguistic regularities in contin -
uous space word representations. Proceedings of the 2013 Conference of 
the North American Chapter of the ACL: Human Language Technologies 
(pp. 746–751).
Nozza, D., Volpetti, C., & Fersini, E. (2019). Unintended Bias in Misogyny 
Detection. Proceedings of IEEE/WIC/ACM International Conference on 
Web Intelligence (pp. 149–155).
Nissim, M., van Noord, R., & van der Goot, R. (2019). Fair is better than sen -
sational: Man is to doctor as woman is to doctor. Computational Linguis-
tics, 46(3), 487–497.
Pearce, M. (2008). Investigating the collocational behaviour of man and wom -
an in the BNC using Sketch Engine. Corpora, 3(1), 1–29.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zet -
tlemoyer, L. (2018). Deep contextualised word representations. Proceed-
ings of NAACL-HLT 2018 (pp. 2227–2237).
Plahuta, M. (2020). O slovarju. Retrieved from https://kontekst.io/oslovarju 
Popič, D., & Gorjanc, V. (2018). Challenges of adopting gender-inclusive lan -
guage in Slovene. Suvremena lingvistika, 44(86), 329–350.
Prates, M. O. R., Avelar, P. H., & Lamb, L. C. (2020). Assessing gender bias in 
machine translation: A case study with Google Translate. Neural Comput-
ing and Applications, 32, 6363–6381.
Rangel, F., Celli, F., Rosso, P., Potthast, M., Stein, B., & Daelemans, W. (2015). 
Overview of the 3rd author profiling task at PAN 2015. In L. Cappellato, 
N. Ferro, G. J. F. Jones in E. SanJuan (Eds.), CLEF 2015 Labs and Work-
shops, Notebook Papers.
Schick, T., Udupa, S., & Schütze, H. (2021). Self-Diagnosis and Self-Debias -
ing: A Proposal for Reducing Corpus-Based Bias in NLP. arXiv preprint 
arXiv:2103.00453.
Sun, T., Gaut, A., Tang, S., Huang, Y., ElSherief, M., Zhao, J., Mirza, D., Beld -
ing, E., Chang, K-W., & Wang, W. Y. (2019). Mitigating gender bias in 
Slovenscina_2_2021_1 korekture3.indd   52 Slovenscina_2_2021_1 korekture3.indd   52 30. 06. 2021   07:56:32 30. 06. 2021   07:56:32
52 53
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
natural language processing: Literature review. Proceedings of the 57th 
Annual Meeting of the ACL (pp. 1630–1640).
Supej, A., Plahuta, M., Purver, M., Mathioudakis, M., & Pollak, S. (2019). Gen -
der, language, and society: Word embeddings as a reflection of social in -
equalities in linguistic corpora. Proceedings of the Slovensko sociološko 
srečanje 2019 – Znanost in družbe prihodnosti (pp. 75–83).
Supej, A., Ulčar, M., Robnik-Šikonja, M., & Pollak, S. (2020). Primerjava slov -
enskih besednih vektorskih vložitev z vidika spola na analogijah poklicev. 
Proceedings of the Conference on Language Technologies & Digital Hu-
manities 2020 (pp. 93–100).
Svoboda, L., & Beliga, S. (2018). Evaluation of Croatian Word Embeddings. 
Proceedings of the Eleventh International Conference on Language Re-
sources and Evaluation (LREC 2018) (pp. 1512–1518).
Škrjanec, I., Lavrač, N., & Pollak, S. (2018). Napovedovanje spola slov -
enskih blogerk in blogerjev. In D. Fišer (Ed.), Viri, orodja in metode 
za analizo spletne slovenščine (pp. 356–373). Ljubljana: Znanstvena 
založba FF.
Tannen, D. (1990). You Just Don’t Understand: Women and Men in Conver-
sation. New York: Ballantine Books.
Ulčar, M. (2019). ELMo embeddings model, Slovenian. Slovenian language 
resource repository CLARIN.SI. http://hdl.handle.net/11356/1257 
Vanmassenhove, E., Hardmeier, C., & Way, A. (2018). Getting gender 
right in neural machine translation. Proceedings of the EMNLP (pp. 
3003–3008).
Verhoeven, B., Škrjanec, I., & Pollak, S. (2017). Gender profiling for Slovene 
Twitter communication: The influence of gender marking, content and 
style. Proceedings of the 6th BSNLP Workshop (pp. 119–125).
Vlada RS (1997). 1641. uredba o uvedbi in uporabi standardne klasifikacije 
poklicev. Uradni list RS, 28, 2217. Retrieved from https://www.uradni-list.si/
glasilo-uradni-listrs/vsebina?urlid=199728&stevilka=1641 
Volkova, S., Wilson, T., & Yarowsky, D. (2013). Exploring demographic lan -
guage variations to improve multilingual sentiment analysis in social me -
dia. Proceedings of the EMNLP (pp. 1815–1827).
Slovenscina_2_2021_1 korekture3.indd   53 Slovenscina_2_2021_1 korekture3.indd   53 30. 06. 2021   07:56:32 30. 06. 2021   07:56:32
54 55
Slovenščina 2.0, 2021 (1)
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K-W. (2017). Men also 
like shopping: Reducing gender bias amplification using corpus-level con -
straints. Proceedings of the EMNLP (pp. 2979–2989).
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K-W. (2018). Gender 
bias in coreference resolution: Evaluation and debiasing methods. Pro-
ceedings of the NAACL-HLT (pp. 15–20).
Slovenscina_2_2021_1 korekture3.indd   54 Slovenscina_2_2021_1 korekture3.indd   54 30. 06. 2021   07:56:32 30. 06. 2021   07:56:32
54 55
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
PRIMERJAVA SLOVENSKIH IN HRVAŠKIH 
BESEDNIH VEKTORSKIH VLOŽITEV Z VIDIKA 
SPOLA NA ANALOGIJAH POKLICEV
V zadnjih letih je uporaba globokih nevronskih mrež in gostih vektorskih vlo -
žitev za predstavitve besedil privedla do vrste odličnih rezultatov na področju 
računalniškega razumevanja naravnega jezika. Prav tako se je pokazalo, da 
vektorske vložitve besed pogosto zajemajo pristranosti z vidika spola, rase ipd. 
Prispevek se osredotoča na evalvacijo vektorskih vložitev besed v slovenščini 
in hrvaščini z vidika spola z uporabo besednih analogij. Sestavili smo seznam 
moških in ženskih samostalnikov za poklice v slovenščini in ovrednotili spolno 
pristranost modelov vložitev fastText, word2vec in ELMo z različnimi konfigu -
racijami in pristopi k računanju analogij. Izkazalo se je, da najmanjšo poklicno 
spolno pristranost vsebujejo vložitve fastText. Tudi za hrvaško evalvacijo smo 
uporabili sezname poklicev in primerjali različne fastText vložitve. 
Ključne besede: besedne vložitve, spolna pristranost, besedne analogije, poklici, 
obdelava naravnega jezika
To delo je ponujeno pod licenco Creative Commons: Priznanje avtorstva-Deljenje pod enakimi 
pogoji 4.0 Mednarodna. / This work is licensed under the Creative Commons Attribution-Share -
Alike 4.0 International.
https://creativecommons.org/licenses/by-sa/4.0/
Slovenscina_2_2021_1 korekture3.indd   55 Slovenscina_2_2021_1 korekture3.indd   55 30. 06. 2021   07:56:33 30. 06. 2021   07:56:33
56 57
Slovenščina 2.0, 2021 (1)
APPENDIX 1 
We present the results, comparing different approaches described in Section 4 
and Section 5. The approach where we lemmatised all the words has the suffix 
lem appended in the tables. The approach where we used the average differ -
ence of vectors of pairs of words from Table 1 has the suffix avg appended in 
the tables. The results for Slovene word embeddings are shown in Table 8, the 
results for Croatian word embeddings in Table 9 and the share of cases, where 
the input occupation is the result of the analogy task, in Table 10.
Table 8: Results for Slovenian embeddings
Slovene word 
embeddings
dimensions
and approach
f input m input
P@1 P@5 P@10 P@1 P@5 P@10
ELMo Embeddia
1024D l0 avg 0.707 0.933 0.947 0.166 0.359 0.387
1024D l0 0.427 0.920 0.947 0.210 0.376 0.398
1024D l0 lem avg 0.907 0.933 0.947 0.370 0.398 0.403
1024D l0 lem 0.893 0.947 0.947 0.376 0.392 0.403
1024D l1 avg 0.907 0.947 0.947 0.381 0.392 0.398
1024D l1 0.880 0.947 0.947 0.376 0.392 0.392
1024D l1 lem avg 0.907 0.947 0.947 0.381 0.392 0.398
1024D l1 lem 0.907 0.947 0.947 0.376 0.392 0.392
1024D l2 avg 0.880 0.933 0.933 0.376 0.398 0.398
1024D l2 0.853 0.920 0.933 0.370 0.398 0.398
1024D l2 lem avg 0.880 0.933 0.933 0.376 0.398 0.398
1024D l2 lem 0.853 0.920 0.933 0.370 0.398 0.398
fastText.cc
300D avg 0.393 0.798 0.913 0.607 0.738 0.751
300D 0.150 0.561 0.792 0.445 0.703 0.734
300D lem avg 0.613 0.884 0.948 0.655 0.755 0.764
300D lem 0.457 0.861 0.919 0.498 0.725 0.751
fastText Embeddia
100D avg 0.900 0.971 0.976 0.672 0.716 0.720
100D 0.471 0.871 0.906 0.638 0.716 0.720
100D lem avg 0.906 0.971 0.976 0.677 0.720 0.724
100D lem 0.735 0.924 0.941 0.638 0.716 0.720
300D avg 0.835 0.971 0.976 0.668 0.716 0.724
300D 0.329 0.859 0.959 0.685 0.720 0.720
300D lem avg 0.947 0.976 0.982 0.685 0.720 0.724
300D lem 0.818 0.971 0.976 0.685 0.720 0.720
Slovenscina_2_2021_1 korekture3.indd   56 Slovenscina_2_2021_1 korekture3.indd   56 30. 06. 2021   07:56:33 30. 06. 2021   07:56:33
56 57
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
Slovene word 
embeddings
dimensions
and approach
f input m input
P@1 P@5 P@10 P@1 P@5 P@10
fastText CLARIN.SI-embed.sl
100D avg 0.784 0.913 0.940 0.761 0.868 0.880
100D 0.083 0.587 0.780 0.705 0.855 0.885
100D lem avg 0.839 0.940 0.950 0.761 0.880 0.902
100D lem 0.651 0.881 0.917 0.709 0.859 0.885
fastText Sketch Engine 
(word)
100D avg 0.886 0.962 0.973 0.717 0.768 0.777
100D 0.211 0.757 0.908 0.691 0.768 0.777
100D lem avg 0.930 0.962 0.973 0.725 0.781 0.785
100D lem 0.811 0.951 0.962 0.691 0.768 0.781
fastText Sketch Engine 
(lemma)
100D avg 0.673 0.931 0.960 0.598 0.786 0.821
100D 0.510 0.812 0.891 0.380 0.658 0.756
word2vec Kontekst.io
256D avg 0.679 0.853 0.872 0.407 0.550 0.593
256D 0.365 0.590 0.718 0.251 0.489 0.515
256D lem avg 0.679 0.853 0.872 0.407 0.550 0.593
256D lem 0.513 0.686 0.795 0.251 0.489 0.519
Note. For each approach, where we have a feminine word for occupation on the input ( f input) and 
we search for the equivalent masculine term, and where we have a masculine word for occupation 
on the input ( m input) and we search for the equivalent feminine term. The examples where the 
embeddings do not cover the input occupation were dismissed. The best result in each column is 
in bold. 
Table 9: Results for Croatian embeddings
Croatian word 
embeddings
dimensions
and approach
f input m input
P@1 P@5 P@10 P@1 P@5 P@10
fastText.cc
300D avg 0.604 0.883 0.944 0.536 0.603 0.609 
300D 0.452 0.838 0.914 0.429 0.599 0.606
300D lem avg 0.731 0.939 0.954 0.546 0.637 0.644
300D lem 0.660 0.924 0.954 0.508 0.618 0.634
fastText Embeddia
100D avg 0.896 0.941 0.959 0.625 0.669 0.672 
100D 0.797 0.928 0.937 0.459 0.634 0.656
100D lem avg 0.905 0.941 0.968 0.625 0.666 0.672
100D lem 0.833 0.932 0.941 0.503 0.641 0.662
300D avg 0.829 0.937 0.973 0.616 0.675 0.675
300D 0.703 0.914 0.950 0.431 0.662 0.672
300D lem avg 0.923 0.982 0.986 0.631 0.675 0.678
300D lem 0.865 0.950 0.964 0.578 0.672 0.675
Slovenscina_2_2021_1 korekture3.indd   57 Slovenscina_2_2021_1 korekture3.indd   57 30. 06. 2021   07:56:33 30. 06. 2021   07:56:33
58 59
Slovenščina 2.0, 2021 (1)
Croatian word 
embeddings
dimensions
and approach
f input m input
P@1 P@5 P@10 P@1 P@5 P@10
fastText CLARIN.SI-embed.hr
(word)
100D avg 0.896 0.933 0.941 0.670 0.749 0.754
100D 0.778 0.904 0.919 0.491 0.699 0.740
100D lem avg 0.907 0.930 0.944 0.673 0.746 0.754
100D lem 0.815 0.904 0.915 0.550 0.711 0.746
fastText CLARIN.SI-embed.hr 
(lemma)
100D avg 0.244 0.678 0.826 0.266 0.521 0.588
100D 0.278 0.593 0.693 0.126 0.336 0.406
Note. For each approach, where we have a feminine word for occupation on the input ( f input) and 
we search for the equivalent masculine term, and where we have a masculine word for occupation 
on the input ( m input) and we search for the equivalent feminine term. The examples where the 
embeddings do not cover the input occupation were dismissed. The best result in each column is 
in bold.
Table 10: Share of cases where the result of the analogy with the highest cosine similarity is 
the input occupation itself - before filtering is done to produce the results of Tables 2 and 3 (both 
male to female and female to male analogies)
Slovene word 
embeddings
Dimensions 
and approach
Share of 
outputs 
equal to 
inputs
Croatian word 
embeddings
Dimensions 
and approach
Share of 
outputs 
equal to 
inputs
ELMo Embeddia
1024D l0 avg 0.547 
1024D l0 0.547
1024D l0 lem avg 0.547
1024D l0 lem 0.547
1024D l1 avg 0.423
1024D l1 0.483
1024D l1 lem avg 0.423
1024D l1 lem 0.483
1024D l2 avg 0.064
1024D l2 0.088
1024D l2 lem avg 0.064
1024D l2 lem 0.088
fT fastText.cc
300D avg 0.831
fT fastText.cc
300D avg 0.672
300D 0.825 300D 0.664
300D lem avg 0.831 300D lem avg 0.672
300D lem 0.825 300D lem 0.664
Slovenscina_2_2021_1 korekture3.indd   58 Slovenscina_2_2021_1 korekture3.indd   58 30. 06. 2021   07:56:33 30. 06. 2021   07:56:33
58 59
M. ULČAR et al.: Slovene and Croatian word embeddings in terms of gender...
Slovene word 
embeddings
Dimensions 
and approach
Share of 
outputs 
equal to 
inputs
Croatian word 
embeddings
Dimensions 
and approach
Share of 
outputs 
equal to 
inputs
fT Embeddia
100D avg 0.143 
ft Embeddia
100D avg 0.094
100D 0.141 100D 0.094
100D lem avg 0.143 100D lem avg 0.094
100D lem 0.141 100D lem 0.094
300D avg 0.419 300D avg 0.352
300D 0.513 300D 0.441
300D lem avg 0.419 300D lem avg 0.352
300D lem 0.513 300D lem 0.441
fT CLARIN.SI-
embed.sl (word)
100D avg 0.316 
fT CLARIN.SI-
embed.hr (word)
100D avg 0.103
100D 0.310 100D 0.114
100D lem avg 0.316 100D lem avg 0.103
100D lem 0.310 100D lem 0.114
fT Sketch Engine 
(word)
100D avg 0.096 
100D 0.135
100D lem avg 0.096
100D lem 0.135
fT Sketch Engine 
(lemma)
100D avg 0.803 
fT CLARIN.
SI-embed.hr 
(lemma)
100D avg 0.837
100D 0.927 100D 0.771
w2v Kontekst.io
256D avg 0.483
256D 0.718
256D lem avg 0.483
256D lem 0.718
Note. The number of all cases is 468 (from 234 occupation pairs) for Slovene and 750 (from 375 
occupation pairs) for Croatian.
Slovenscina_2_2021_1 korekture3.indd   59 Slovenscina_2_2021_1 korekture3.indd   59 30. 06. 2021   07:56:33 30. 06. 2021   07:56:33