59
Annotated Lexicon for Sentiment 
Analysis in the Bosnian Language
Sead JAHIĆ
Faculty of Mathematics, Natural Science and Information Technologies, University of 
Primorska
Jernej VIČIČ
Faculty of Mathematics, Natural Science and Information Technologies, University of 
Primorska; Fran Ramovš Institute of the Slovenian Language
The paper presents the first sentiment-annotated lexicon of the Bosnian lan-
guage. The annotation process and methodology are presented along with a 
usability study, which concentrates on language coverage. The composition 
of the starting base was done by translating the Slovenian annotated lexicon 
and later manually checking the translations and annotations. The language 
coverage was observed using two reference corpora. The Bosnian language 
is still considered a low-resource language. A reference corpus comprised of 
automatically crawled web pages is available for the Bosnian language, but 
the authors had a hard time sourcing any corpora with a clear time frame for 
the text contained therein. A corpus of contemporary texts was constructed by 
collecting news articles from several Bosnian web portals. Two language cov-
erage methods were used in this experiment. The first used a frequency list 
of all words extracted from two reference Bosnian language corpora, and the 
second ignored the frequencies as the main factor in counting. The computed 
coverage using the first presented method for the first corpus was 19.24%, 
while the second corpus yielded 28.05%. The second method yielded 2.34% 
coverage for the first corpus and 6.98% for the second corpus. The results of 
the study present a language coverage that is comparable to the state of the 
Jahić, S. et al.: Annotated Lexicon for Sentiment Analysis in the Bosnian Language. 
Slovenščina 2.0, 11(2): 59–83. 
1.01 Izvirni znanstveni članek / Original Scientific Article
DOI: https://doi.org/10.4312/slo2.0.2023.2.59-83
https://creativecommons.org/licenses/by-sa/4.0/
60
Slovenščina 2.0, 2023 (2) | Articles
art in the field. The usability of the lexicon was already proven in a Twitter-
based comparison.
Keywords: Bosnian lexicon, corpus, sentiment analysis, AnAwords, stopwords, 
log-likelihood, annotation
1 Introduction
Sentiment analysis, also known as opinion mining, is a field of stu-
dy within the larger discipline of natural language processing (NLP) 
that aims to determine the sentiment expressed in text, categorizing 
it as positive, negative, or neutral. The goal of sentiment analysis is to 
extract meaningful insights from large amounts of unstructured data, 
such as social media posts (Iglesias and Moreno, 2019) or online pro-
duct reviews (Wu et al., 2020), in order to understand public opinions 
and attitudes. In this paper we present the coverage of the first Bosni-
an sentiment annotated lexicon using two reference corpora. 
Although arguably Bosnian is closely related to Serbian and Cro-
atian, there are subtle differences between these three languages 
that are more evident from the sentiment analysis point of view. The 
main differences between Bosnian, Serbian, and Croatian lie in the 
use of vocabulary, grammar, and syntax. Although the three langu-
ages share a similar Slavic origin and linguistic heritage, they have 
evolved differently over time and been influenced by different cul-
tural, historical, and political factors. These differences are particu-
larly pronounced when it comes to sentiment analysis, as the choi-
ce of words and the way they are used can significantly impact the 
sentiment expressed in a text. As such, it is important to consider 
these differences when developing sentiment analysis tools for the 
Bosnian language. 
The lexicon used in this study has been constructed using two re-
ference corpora and combines NLP and machine learning techniques 
to assign weighted sentiment scores to the entities within a sentence 
or phrase. The study covers two approaches to evaluate the perfor-
mance of the lexicon – the first takes into account the frequencies 
of the covered and missed words, while the second just counts the 
words that are covered by the lexicon.
61
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
The paper provides a comprehensive overview of the state of the 
art in NLP and sentiment analysis for the Bosnian language. It explains 
the methodology used in the process of creating the lexicon, cleaning 
the corpora, the corpora covered by the lexicon, and annotation. The 
results of the experiment and the conclusion, along with suggestions 
for future work, are presented in the last section of the paper.
In summary, the development of the Bosnian sentiment anno-
tated lexicon is a step towards better understanding and analysing 
public opinion expressed in the Bosnian language. The results of the 
study suggest that the lexicon has good coverage, and the methodolo-
gy used in the construction of the lexicon can serve as a reference for 
future work in this field. 
2 State of the art
There has been quite extensive research in the area of sentiment 
analysis, and many types of models and algorithms have been proposed 
depending on the final goal of the analysis of the interpretation of user 
feedback and queries, such as fine-grained sentiment analysis (based 
on polarity precision) (Chen et al., 2020), emotion detection, aspect-ba-
sed sentiment analysis (Suciati and Budi, 2020), and multilingual sen-
timent analysis (Kia et al., 2016). All those algorithms and models can 
be divided into one of three basic classes: rule-based systems (relying 
on long-used linguistic methods, rules, and annotated linguistic mate-
rials such as annotated lexicons), automatic (corpus-based) systems, 
and hybrid systems that combine properties from both previous types. 
In the latter, hybrid systems use machine learning techniques together 
with NLP techniques developed in computational linguistics, such as 
stemming, tokenization, part-of-speech tagging, parsing, and lexicons.
Lexicons have been widely used for sentiment analysis. One of 
the first-known, human-annotated lexicons for sentiment analysis is 
the General Inquirer lexicon (Hartman et al., 1967), which contains 
11,788 English words (2,291 labeled as negative and 1,915 as positi-
ve, with the rest, labeled as objective). 
Sentiment lexicons exist for most Slavic languages, including 
Bulgarian (Kapukaranov and Nakov, 2015), Croatian (Glavaš et al., 
62
Slovenščina 2.0, 2023 (2) | Articles
2012), Czech (Veselovská, 2013), Macedonian (Jovanoski et al., 
2015), Polish (Wawer, 2012), Slovak (Okruhlica, 2013), Slovenian 
(Kadunc, 2016) and Bosnian (Jahić and Vičič, 2023b), with the last 
of these containing 1,219 entries labeled as positive and 3,935 as 
negative. 
Important questions for natural language researchers, general 
linguists, and even teachers and students are how much text coverage 
can be achieved with a certain number of words from the lexicon in a 
given language, since the number of terms in the lexicon is smaller by 
a few magnitudes than the number of terms in the corpus.
Studies of vocabulary coverage have been carried out for many 
languages, such as German (Jones, 2006), where a study based on 
the BYU/Leipzig Corpus of Contemporary German has shown that a 
basic vocabulary of 3,000 high – frequency words can account for be-
tween 75% and 90% of the words in the text. Moreover, with Spanish 
(Davies, 2005) it is claimed that it is enough to know 4,000 words to 
cover or recognize more than 90% of the words in native texts. Mo-
reno-Ortiz and Pérez-Hernández (2018) presented Lingmotif-lex, a 
wide-coverage, domain-neutral lexicon for sentiment analysis in En-
glish, and stated that it achieves significantly better performance than 
the other lexicons for English, with coverage of up to 75% and 84% 
(F1-score) for two datasets. 
In a study aimed at developing resources for sentiment analysis 
in Slovene, Bučar, Žnidaršič and Povh (2018) collected more than 
250,000 news items from five Slovenian online media sources as the 
basis for their resources, which corpora, annotations, and a lexicon. 
To evaluate the quality of the annotation process, they used five diffe-
rent measures of correlation. The results showed good internal consi-
stency across all levels of granularity, although the values decreased 
slightly when applied to the smaller units of text.
Corpus-based and lexicon-based model methods have been inc-
reasingly used to compare language coverage, and the comparison of 
hundreds of thousands or even millions of words/lemmas from a cor-
pus with a few thousand words/lemmas from a lexicon is one of the 
main types of corpus comparison.
63
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
3 Construction of the lexicon
The Bosnian sentiment-annotated lexicon is presented and analysed 
in this paper. For this purpose, our data consists of the “core” lexicon 
(Jahić and Vičič, 2023b), a list of stopwords, and a list of AnAwords 
(Affirmative and Non-affirmative words, such as “ekstremno” (“vrlo”) 
– extremely, “jedva” – barely) (Jahić and Vičič, 2023a), as clarified in 
Figure 1.
Figure 1: Construction of the lexicon.
The lexicon creation process is comprised of taking entries (word 
forms) from the Slovene sentiment lexicon KSS 1.1 (Kadunc, 2016) 
and translating them into Bosnian. We also allow some variance of the 
same lemma as part of the lexicon.The process of creating the Bosnian 
translation was undertaken in a dual-phase approach. In the initial pha-
se, the transformation of the Slovenian lexicon into Bosnian took place 
through well-defined steps. Initially, the Slovenian lexicon underwent 
translation into English through the utilization of the translators from 
Google and Microsoft. Subsequently, this intermediary English version 
was subjected to translation into the Bosnian language, which is visual-
ly depicted in Figure 2. Moreover, the first phase involved these steps:
• Translation using Microsoft Translator for the Slovene sentiment 
lexicon KSS 1.1.
• Translation using Google Translator for the same lexicon.
• Manual comparison and merging of the two lists, removing dupli-
cate entries.
64
Slovenščina 2.0, 2023 (2) | Articles
• Manual cross-checking to ensure that words had matching or si-
milar meanings.
• The result was the creation of the Bosnian_MG_Translated 
Lexicon.
Figure 2: The lexicon creation process.
The last phase witnessed the creation of the lexicon in a two-fold 
manner. Firstly, word forms from the Slovenian lexicon were manually 
translated into Bosnian through manual means. This comprehensive 
process encompassed a verification of each term using various tools, 
including Pons,1 Google Translate, ImTranslator,2 and the Dictionary 
of Slovenian Literary Language (SSKJ – Slovar slovenskega knjižnega 
jezika3). The results of this process yielded the “Bosnian_Manually_
Translated” lexicon.
These two lexicons (“Bosnian_MG_Translated” lexicon and “Bo-
snian_Manually_Translated” lexicon) were subsequently united and 
merged into a cohesive entity, referred to as the “Bosnian_Merged” 
lexicon. The refinement process further entailed the removal of dupli-
cate entries, resulting in the initial iteration of the Bosnian sentiment 
1 https://sl.pons.com/
2 https://imtranslator.net/
3 https://fran.si/
65
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
lexicon. To ensure the accuracy and robustness of the lexicon, a back-
-translation procedure was executed. This involved translating the 
newly constructed Bosnian lexicon back into the Slovenian language, 
as depicted in Figure 2. 
The goal of the back-translation procedure was to retranslate the 
obtained Bosnian lexicon into a Slovenian lexicon and then compa-
re this translated lexicon (in Slovenian) with the initial KSS 1.1 lexi-
con. What we found during the back-translation process is that many 
words were translated into a form that is not present in KSS 1.1, while 
the infinitive form of those words is indeed available in KSS 1.1.
To circumvent this challenge in the evaluation phase and also in 
the process of using the lexicon in the sentiment analysis process, we 
used the ‘get _close_matches’ function (part of the difflib module in 
Python). By using this function we effectively pinpointed the closest 
approximations to the target string from a pool of candidate strings. 
This process substantially improved the coverage and reliability of our 
lexicon, amplifying the precision of our sentiment analysis efforts.
The method works by comparing the target string with each can-
didate string, using a defined similarity ratio, and then returning the 
matches with the highest similarity ratio. The number of matches re-
turned and the similarity ratio threshold can be controlled through 
the n and cutoff parameters, respectively. The order of close-matched 
strings is based on the similarity score, so the most similar string co-
mes first in the list. 
This function accepts four parameters:
• word: This is the string for which we need the close matches. 
• possibilities: This is usually a list of string values with which the 
word is matched. 
• n: This is an optional parameter with a default value of 3. It speci-
fies the maximum number of close matches required. 
• cutoff: This is also an optional parameter with a default value of 
0.6 . It specifies that the close matches should have a score grea-
ter than the cutoff.
66
Slovenščina 2.0, 2023 (2) | Articles
In our case, we pick the first element from the close-matched 
strings list (with the highest similarity score). More cutoff values were 
considered, and the best confidence-accuracy score was reached 
with a cutoff of 81%.
Table 1: Comparing the Slovenian lexicon before and after the translation process
Slovenian lexicon 
(lemmas) 
 (Kadunc, 2016)
cutoff
Slovenian MG_Translated 
lexicon Comparing 
 accuracytranslated 
words
matched 
words
Positive 1911
80% 1758 1829 -
81% 1781 1686 88.23%
82.5% 1790 1627 85.14%
85% 1806 1550 81.11%
90% 1838 1369 71.64%
100% 1858 1235 64.63%
Negative 5125
80% 4572 4999 -
81% 4654 4604 89.83%
82.5% 4690 4432 86.48%
85% 4739 4125 80.49%
90% 4846 3514 68.57%
100% 4898 3067 59.84%
The accuracy score was counted by comparing the primary lexi-
con of the Slovenian language (Kadunc, 2016) and the back-transla-
ted lexicon of the Slovenian language.
The equation used is as follows:
The Bosnian sentiment lexicon consists of 3,935 negative words 
(Lexicon negative), and 1219  positive words (Lexicon positive). Be-
sides that, we also added a list of 394 Bosnian stopwords (such 
as: “gosp.” (“Mr.”), “je” (“is”), “juli” (“July”), and so on), and list of 
AnAwords. Stopwords usually refer to the most common words in 
a language, and there is no single universal list of these. The first 
67
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
Bosnian sentiment lexicon was tested by using it to label tweets writ-
ten in the Bosnian language (Jahić and Vičič, 2023a, 2023c).
4 Methodology and work
The core emphasis of this paper is on assessing the coverage achie-
ved by the lexicon, rather than on its creation, although a comprehen-
sive account of this is also presented. More specifically, the focus of 
is on evaluating how many lemmas the lexicon covers in bsWac and 
bsNews, as detailed below.
The language coverage of the lexicon was evaluated through two 
different corpora: 
• The Bosnian web corpus bsWaC 1.1 (Ljubešić and Klubička, 2014). 
The bsWaC 1.1 corpus was part of a collection of corpora, named 
the {bs, hr, sr}WaC – Web corpora of Bosnian, Croatian, and Serbi-
an languages. The number of seed URLs (crawled web pages) was 
8,388 for bsWaC, 11,427 for srWaC, and 14,396 for hrWaC. 
The bsWaC corpus consists of more than 285 million tokens 
(286,865,790, to be precise) written in Bosnian. The corpus is 
also morphosyntactically annotated and lemmatized. At the time 
of writing, this corpus was the de facto reference corpus for the 
Bosnian language.
• The Bosnian news corpus 2021 bsNews 1.0 (Vičič, 2021), which 
is a collection of web news articles crawled at the start of 2021. 
The corpus contains a balanced set of at most 2,000 of the most 
recent news articles from each identified web news portal in Bo-
snia and Herzegovina. The list of portals is maintained by Press 
Council in Bosnia and Herzegovina.4
The corpus contains news articles from 46 portals. This corpus 
was used as a contemporary and balanced source. The sentence 
tokens are morpho-syntactically annotated with MULTEXT-East 
morpho-syntactic annotations for Croatian, Version 65. The cor-
pus was morpho-syntactically annotated and lemmatized with 
ToTaLe (Erjavec et al., 2015). It consists of more than 36 million 
tokens in the Bosnian language.
4 Vijeće za štampu u Bosni i Hercegovini: https://www.vzs.ba/index.php/vijece-za-stampu/
internet-portali-u-bosni-i-hercegovini.
5 http://nl.ijs.si/ME/V6/
68
Slovenščina 2.0, 2023 (2) | Articles
Two different approaches are applied:
• First, all lemmas with their frequencies were considered,
• Second, the frequencies for lemmas were ignored.
A list of lemmas with frequencies was extracted from each corpus 
and cut off at five occurrences to avoid clutter.
The list of lemmas extracted from the first corpus (Ljubešić and 
Klubička, 2014) consisted of 348,988 different lemmas with frequen-
cy. The lemmas are ordered in increasing order by frequency, where 
the lowest value is five (cutoff) (“batkovi - drumsticks” …) and the hi-
ghest value is 16,652,066 for the lemma “biti - to be”.
The list of lemmas extracted from the second corpus (BsNews 1.0 
corpus (Vičič, 2021)) consisted of 101,771 lemmas ordered in decre-
asing order, the most frequent lemma again being “biti – to be” with 
the frequency of 2,350,487, and with the lowest frequency of five for 
lemmas such as “polegnuti – lay down”.
Not all lemmas can be included in the analysis. Symbols, equation 
marks, and numbers, even if part of the corpus, cannot be part of the 
lexicon, especially a sentiment annotated lexicon.
The following items were thus removed from both corpora in the 
cleaning process: emoticons, punctuation like quotes, exclamation 
marks, etc., numbers, and hyperlinks.
4.1 The first approach: lemmas with their frequency were 
included in the analysis (all appearances of lemmas were 
used for each corpus)
In the first approach used in our analysis, we considered lemmas 
along with their frequency as the basis for our investigation. This me-
ans that we included all instances of lemmas found in each corpus for 
our analysis.
69
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
Figure 3 shows the procedure for checking the existence of given 
words from the lexicon in the corpus.
Figure 3: Process of matching words from the corpus with words from the lexicon.
If this statement is true and the word exists in the corpus, the 
value of freq(lexicon) is accumulated for the value of the frequency of 
each word, otherwise, the value 0 is added to freq(lexicon).
The sum of all word’s frequencies from the corpus is given as 
freq(corpus) and freq(stopwords) presents the sum of all frequencies 
of the stopwords that appears in the corpus. 
The coverage is counted as:
( 1 )
where all stopwords were excluded from the corpus.
4.2 The second approach: using the accuracy of words 
without the influence of frequency
Given that the sentiment value is not at the forefront at this stage of 
research (we are looking for language coverage), lists of 1,219 positi-
ve and 3,935 negative words were united in a unique lexicon.
70
Slovenščina 2.0, 2023 (2) | Articles
In addition to lexicons, two other groups of words — stopwords 
(394 of them) and AnAwords (Affirmative and Non-affirmative words) 
– play a significant role in this process. Jahić and Vičič (2023a) poin-
ted out that stopwords usually refer to the most common words in a 
language and that there is no single universal list of these. 
However, 139 words from the AnAwords list were created by Ja-
hić and Vičič (2023a), and it has been proven (Osmankadić, 2003) 
that most of these are intensifiers.
The consideration of words from the AnAwords list significantly 
impacted the corpus coverage, as elucidated in the second stage of 
the second approach. These words were evaluated in a manner si-
milar to stopwords, given their absence of inherent sentiment value. 
Consequently, they were excised from the corpora, in line with the 
objective of eliminating non-sentiment-bearing terms.
Taking this into account, the AnAwords were also subject to exa-
mination, considering their lack of any discernible sentiment value. 
Consequently, they were treated analogously to stopwords, leading to 
their exclusion from the analysis
The process of annotating the lexicon went through several sta-
ges, and they were all based on the following equation: 
( 2 )
where FOUND presented the list of all words in the corpus that were 
matched with words from the lexicon, NOT_FOUND opposite. 
In more details, these stages are as follows:
• Simple coverage of the corpus by a lexicon was shown in the first 
stage. The stopwords were part of the corpus at this stage. 
• In the first stage, stopwords were integrated into the corpus, con-
tributing to the text’s initial structure. However, as the coverage 
process unfolded in the second stage, the corpus coverage was 
achieved without the incorporation of stopwords, in addition to 
the exclusion of AnAwords words. The rationale behind these 
71
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
decisions stems from the fact that the number of stopwords and 
AnAwords is almost negligible in comparison to the total number 
of elements in the corpus. As such, a substantial variance in cove-
rage during this stage was not anticipated.
• Guided by the results of research conducted for the corpus-based 
lexical analysis of subject-specific university textbooks in English 
(Hajiyeva, 2015), in the third stage coverage was observed by the 
frequency distribution of words. 
• In the fourth stage, the question arises as to whether it is possible 
to group similar words (such as “anđeo” and “anđel” (angel)) and 
view them as a single word. As Davies (2005) stated, one of the 
solutions to this problem is grouping words according to word fa-
milies. Given this possibility of grouping, matching functions were 
applied between corpus words and lexicon words. 
• In the fifth stage the log-likelihood was computed for each word 
in the lexicon. Following Rayson and Garside (2000), the word 
frequency list is then sorted by the resulting log-likelihood values. 
This gives the effect of placing the largest log-likelihood value at 
the top of the list representing the word that has the most signifi-
cant relative frequency difference between the two corpora. This 
method enables a comparison of the most indicative (or characte-
ristic) words in both corpora.
5 Results
This section showcases the results of the two approaches described 
earlier. We started by cleaning the corpora, which led to the inclusion 
of 263,969 words from the first corpus and 84,859 words from the 
second in our subsequent analysis (see Table 2).
Table 2: Number of lemmas left after pre-processing the corpora
CORPUS1 CORPUS2
The overall number of lemmas 348,988 101,771
Cleared lemmas 263,969 84,859
Percent (%) 75.64 83.38
72
Slovenščina 2.0, 2023 (2) | Articles
In the first approach (influence of frequency was considered), 
freq(corpus), the sum of stopwords frequencies freq(Stopwords)) 
and the overall sum of all frequencies of the words from the lexicon 
(freq(lexicon)) were computed.
By using equation (1) coverage of the corpus1 is 19.24%, and co-
verage of the corpus2 is 28.05% (see Table 3).
Table 3: Coverage of the corpora’s lemmas with words from sentiment lexicon
Freq(corpus) Freq(lexicon) Freq(stopwords) Coverage
CORPUS1 187,957,442 28,174,959 41,542,468 19.24%
CORPUS2 3,0168,771 6,371,417 7,456,808 28.05%
The second approach (when the influence of frequency is igno-
red) was to compute the overall coverage of the corpora without using 
word frequencies. The motivation behind this approach was to count 
how many different lemmas from the corpus are already present in 
the sentiment lexicon. There are several stages in this approach. 
• First stage: In this first stage, 1.523% coverage of the first corpus 
and 4.098% for the second corpus was achieved.
Table 4: Coverage of corpora’s lemmas with words from the sentiment lexicon (without 
any additional changes being made)
CORPUS1 CORPUS2
FOUND 3,959 3,341
NOT_FOUND 260,010 81,518
Coverage (%) 1.523 4.098
Table 4 presents lemmas that were matched with words 
from the lexicon (FOUND) and that were absent from the lexicon 
(NOT_FOUND).
Maximum coverage of corpora is possible if all words from the 
lexicon are included in the corpora. This means that the maximum 
coverage for the first corpus is 1.99% and for the second corpus 
is 6.47%. In contrast, the coverage of the lexicon by the corpora 
is 76.81% and 64.82%. This means that of 5,154 words from the 
lexicon, 3,959 were presented in corpus1, indicating 76.81% use 
73
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
of the lexicon, and 3,341 were presented in corpus2, indicating 
64.82% use.
• The second stage increases the number of words in FOUND since 
all words that are stopwords or AnAwords have been detected in 
the corpus. In this case, coverage of corpora is increased to 1.7% 
and 4.62% for corpus1 and corpus2,respectively.
Table 5: Coverage of corpora’s lemmas with word from the sentiment lexicon
CORPUS1 CORPUS2
FOUND 4,406 3,747
NOT_FOUND 259,533 81,112
Coverage (%) 1.7 4.62
• The third stage is distributing words by frequency and counting 
the number of lemmas that were or were not covered by words 
from the lexicon.
From a total of 5,154 words from the lexicon,3,257 (63.19%) 
were included in the 50,000 most frequent lemmas from corpus1 
(see Figure 4 (left)). Meanwhile, of the 15,000 most frequent 
lemmas from corpus2, 3,071 were in the lexicon.Since the ove-
rall number of words from the lexicon is 5,154,this means that 
gave 59.58% of all words from the lexicon are found in the 15,000 
most frequent lemmas from corpus2 (see Figure 4 (right)). 
Figure 4: Annotated lexicon by distributed lemmas from corpus1 (left) and corpus2 
(right).
• Fourth stage: In the fourth stage the lexicon annotation was inc-
reased to more than 2.2% for the first corpus. Even though it lo-
oks like this contradicts the claim that the maximum coverage for 
the first corpus is about 1.54%, it does not. The reason for this is 
74
Slovenščina 2.0, 2023 (2) | Articles
because the get_close_matches function was applied with several 
cutoffs and n=1 (one possibility).
The function works in such a way that all almost similar words 
(82.5% and 85% matching in this case for the first and second 
corpora, respectively) are considered as one word. For exam-
ple, anđel (Engl. angel), anđelko (Engl. little angel), and anđela 
(“I saw an angel”), all three words were replaced with the word 
anđeo. We have found that for a cutoff lower than of the 82.5% 
matching function returns words that are not matched or related 
to the root word. 
The impact of get_close_matches is presented in Figure 5 on 
a small part of corpus2 with the matching word “anđeo” (angel). 
Figure 5: Implementation of get_close_matches on corpus1 of the word “anđeo” (an 
angel).
In the figure shown above there are some words whose mat-
ching factor with the word “anđeo” was greater than 0.85. Those 
words (anđeo, anđelko, anđeoski) were replaced with the word 
“anđeo”. Having that in mind, the number of words (for corpus1) 
in the corpora decreased (see Table 6). If an 85% cutoff is appli-
ed, the overall number of words in corpus1 is reduced to 242,615, 
and the annotation increases to 2.22%. This means that out of 
242,615 words in corpus1, 5,280 were found in the lexicon, sto-
pwords, and AnAwords groups. The same applies to corpus2, 
where an annotation of 4,706 words from the lexicon in corpus2 
75
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
was detected. Moreover, when a cutoff of 82.5% is applied, we 
get an annotation of 2.34% and 6.98% for the first and second 
corpora, respectively.
Table 6: Coverage of corpora’s lemmas with words from the sentiment lexicon
Cutoff 82.5% 85%
Corpus: CORPUS1 CORPUS2 CORPUS1 CORPUS2
No. of lemmas 233,157 74,134 242,615 77,437
FOUND 5,327 4,838 5,280 4,706
NOT FOUND 227,830 69,296 224,939 72,731
Coverage (%) 2.34 6.98 2.22 6.47
• Fifth stage: In the fifth stage the log-likelihood was computed 
for each of the 5,000 most frequent lemmas from both corpora 
(10,000 overall) and only those that were common for both cor-
pora were counted (4,207 in total). 
In this way, the opportunity was given to compare the frequen-
cies of word form occurrences in two texts (for this purpose two 
corpora) and obtain a statistical measure of the significance of the 
differences. 
To compute the log-likelihood, a two-by-two contingency ta-
ble (see Table 7) of frequencies for each word was constructed.
Table 7: Contingency table for word frequencies
CORPUS1 CORPUS2 TOTAL
Freq. of word a b a+b
Freq. of other words c-a d-b c+d-a-b
TOTAL c d c+d
In Table 7, c and d present the number of words in the corpo-
ra. In this case, c = 183m481,818 and d = 28m690m802. c and d 
were obtained by summing all the frequencies of all 4,207 words.
Following Rayson and Garside (2000), the equations:
76
Slovenščina 2.0, 2023 (2) | Articles
were used to calculate the expected values, and:
to calculate the log-likelihood for each word.
Using these equations a word frequency list (LL_list) was cre-
ated, and the words were sorted from the smallest to largest valu-
es, where the largest value represents the word that has the most 
significant relative frequency difference between the two corpora. 
As such, the most characteristic words of one corpus, as com-
pared to the other corpus, were listed at the bottom of the given 
list, while words with almost the same relative frequency were 
listed at the top.
To evaluate the result and identify the N number of words that 
have similar interpreted values, we needed another method. As 
stated by Kilgarriff in Comparing Corpora (Kilgarriff, 2001), the 
simplest method that could be used for this is applying Sketch 
Engine. For each word the quotient of frequency was computed, 
and if the value of the quotient is 1 it indicates that its frequency is 
identical in both corpora. The higher the score in the Sketch Engi-
ne frequency word list (SE_list), the greater the difference betwe-
en corpora. However, it should be noted that the given score can 
only be used for comparing differences, and it does not give clues 
as to what exactly is different between the corpora.
Because of this, the identification of words with similar inter-
preted values was done. This means that the percentage of co-
verage of N’s highest keyness score words with words from the 
lexicon for both lists LL_list and SE_list was computed.
The 500 most relevant words for both lists were identified. 
These words distinguish one corpus from the other, and also 
present the strengths of one corpus over the other. First the log-
-likelihood was computed and the LL_list was created. A list of 
500 words with the biggest frequency differences between the 
two corpora was created. Then the two corpora were compared 
77
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
by using Sketch Engine and the SE_list was created. As for the 
LL_list, the 500 words with the biggest frequency differences be-
tween the two corpora were identified.
The aim was not to compare corpora but to check the cove-
rage of the most relevant words – those that distinguish the two 
corpora from each other – with words from the lexicon.
Using these comparison methods produced a matching factor 
of 55.2% between LL_list and SE_list of the 500 most relevant 
words.
Table 8: Coverage of 500 most relevant words from the lexicon group and distribution 
of words from the lexicon, stopwords, and AnAwords lists (LSAnA group)
LL_list SE_list
FOUND words 156 136
Coverage (%) 31.2% 27.2%
By words from Lexicon
Stopwords AnAwords
Lexicon
Stopwords AnAwords
pos neg pos neg
30 50 62 14 44 42 35 15
As can be seen in Table 8, from the 500 words there were 156 
from the LSAnA group that matches them (31.2% for the LL_list) 
and 136 words from the LSAnA,representing 16.8% of coverage 
from the SE_list.
Words from the lexicon had 51.28% coverage ((30+50)/156) 
and annotation of the lexicon in those 500 words from LL_list had 
16% coverage (80/500). For the SE_list this annotation had about 
27.2% coverage, and the overall impact of lexicon words with re-
gard to all the words covered by the SE_list was about 63.24% 
((44+42)/136). 
Even though the third and fifth stages present insights into 
the annotation of the most frequent words, for overall annotation 
the most important stages were the first, second and fourth ones, 
since they produce the overall coverage of the corpora by lexicon 
(see Table 9).
78
Slovenščina 2.0, 2023 (2) | Articles
Table 9: Annotation of corpora
Approach:
Coverage of corpora
CORPUS1 (bsWaC) CORPUS2 (bsNews)
First
by using the accuracy of words with the influence of frequency
19.24% 28.05%
Second by using the accuracy of words without the influence of frequency
First 1.523% 4.098%
Second 1.7% 4.62%
Fourth 2.22%-2.34% 6.47%-6.98%
6 Conclusion
Although Bosnian is arguably closely related to Serbian and Croatian, 
there are subtle differences between these three languages that are 
more evident from the sentiment analysis point of view. This paper 
presents the annotation of the first Bosnian sentiment lexicon that has 
been proven on a sentiment basis in earlier work. The lexicon includes 
about 5,500 words (1,219 positive, 3,935 negative, 394 stopwords, 
and 139 AnA words) and covers more than 19% of the words in the 
first observed corpus (corpus1) (Ljubešić & Klubička,2014), and more 
than 28% of words in the second corpus, BsNews 1.0 corpus (cor-
pus2), (Vičič,2021). If the emphasis is on coverage of different words 
from the corpus by the lexicon, then coverage is 1.7% for corpus1 
and 4.62% for corpus2. This coverage will increase by applying some 
matching functions between the corpora’s words and lexicon’s words 
(as described in the fourth stage of the second approach in Section 
4). In that case, the coverage rises to 2.34% for corpus1 and 6.98% 
for corpus2. From 85.07% to 93.67% of words from the lexicon were 
found in corpus1 (between 360 and 849 words from the lexicon were 
not found) and between 82.75% and 92.84% in corpus2, which me-
ans that between 407 and 981 words from the lexicon were not found 
in corpus2.
The results show that about a quarter of the words from the cor-
pora have their sentiment value annotated in the lexicon, which grea-
tly helps in the sentiment annotation of the sentences (tweets or re-
gular text).
79
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
Stopwords and AnAwords were also included in the analysis, whi-
ch leads to the possibility that the LSAnA group becomes a represen-
tative group for sentiment words, stopwords, and intensifiers (all writ-
ten in Bosnian).
The language coverage of the lexicon is comparable with the cur-
rent state of the art, and the values can be compared (Moreno-Ortiz & 
Pérez-Hernández, 2018).
During the process of creating our lexicon, we were aware that the-
re would be deviations during the translation. The Slovene sentiment 
lexicon KSS 1.1 also includes multi-part words, which are words com-
posed of multiple individual words joined by “_”, such as “dobro_spre-
jet”, “dobro_upravljan”, “dobro_voden”, “dobro_vzgojen”, “energetsko_
varčen”, “funkcijsko_bogat”, and similar terms. Most of these types of 
words do not have an equivalent in the Bosnian language. However, 
during the manual review of our lexicon we noticed that some of these 
words could be included, such as “prekomerna_teža” (Bosnian: prede-
belo, English: too fat) or “srce_parajoč” (Bosnian: srceparajuće, Engli-
sh: heartbreaking). Despite this, these words did not make it into the 
primary version of our lexicon.
Table 10: Comparison of lexicon terms
Positive Negative
Language core terms terms with “_” core terms terms with “_”
Slovenian 1,911 61 5,152 276
Bosnian 1,126 - 3,868 -
According to Table 10, it is evident that the Bosnian lexicon can be 
updated by finding appropriate translations for multi-part words from 
KSS 1.1. This update would have an immediate impact on the lexicon’s 
annotation, as incorporating more terms would allow for more compre-
hensive annotation. Furthermore, during our analysis we discovered 
that among these multi-part words some contained elements from the 
AnAwords list, which we treated separately. Examples of such cases in-
clude “hudo_bolan” (Bosnian: veoma bolan, English: very painful), “zelo_
poceni” (Bosnian: veoma jeftin, English: very cheap), “povsem_prava” 
(Bosnian: potpuno pravo (tačno), English: completely right), and others.
80
Slovenščina 2.0, 2023 (2) | Articles
Additionally, we found that there were entire expressions in the 
Slovenian lexicon that were not included in the Bosnian lexicon. Some 
examples of these are “nič_hudega_sluteč” (Bosnian: ne slutiti ništa 
loše, English: unaware of any harm), “obesiti_na_klina” (Bosnian: 
objesiti o klin, English: hang on a nail), and “veliko_hrupa_za_nič” 
(Bosnian: mnogo buke oko ničega, English: much ado about nothing), 
“zvit_kot_lisica” (Bosnian: lukav kao lisica, English: sly as a fox), 
among others.
The focus in future work will be on developing and improving the 
LSAnA group. All members of the group should be extended, which 
means that we expect to have more items/words labelled as positive 
or negative in our “core” lexicon, as well as extending lists of stop-
-words and AnAwords. To increase coverage, we will try to create a 
lexicon with all possible words, and in doing so we will contain all the 
grammatical rules found in the Bosnian language itself (declination, 
conjugation, change of words by gender, number, and so on). Although 
the process of annotation, as well as improvement of the first Bosnian 
lexicon (Jahić and Vičič, 2023b), is still in development, the results 
shown here are comparable with those reported for other related lan-
guages, and also for language families, as shown in Davies (2005) and 
Bučar, Žnidaršič and Povh (2018).
Acknowledgments
The authors gratefully acknowledge the European Commission for funding 
the InnoRenew CoE project (Grant Agreement #739574) under the Horizon 
2020 Widespread-Teaming programme and the Republic of Slovenia (invest-
ment funding of the Republic of Slovenia and the European Union of the Eu-
ropean Regional Development Fund).
References
Bučar, J., Žnidaršič, M., & Povh, J. (2018). Annotated news corpora and a 
lexicon for sentiment analysis in slovene. Language Resources and Eva-
luation, 52, 895– 919. doi:10.1007/s10579-018-9413-3
Chen, C., Hu, X., Zhang, H., & Shou, Z. (2020). Fine grained sentiment analysis 
based on Bert. Journal of Physics: Conference Series, 1651.
81
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
Davies, M. (2005). Vocabulary range and text coverage. insights from the 
forthcoming routledge frequency dictionary of spanish. Selected Procee-
dings of the 7th Hispanic Linguistics Symposium (pp. 106–115).
Erjavec, T., Ignat, C., Pouliquen, B., & Steinberger, R. (2015). Massive multi 
lingual corpus compilation: Acquis communautaire and totale. Archives 
of Control Sciences 15.
Glavaš, G., Šnajder, J., & Bašić, B. D. (2012). Semi-supervised acquisition 
of croatian sentiment. Proceedings of the International Conference on 
Text, Speech and Dialogue, 7499 (pp. 166–173). Brno, Czech Republic. 
doi:10.1007/978- 3- 642- 32790- 2_20
Hajiyeva, K. (2015). A corpus-based lexical analysis of subject-specific 
university textbooks for english majors, 2, 136–144. doi:https://doi.
org/10.1016/j.amper.2015.10.001
Hartman, J. J., Stone, P. J., Dunphy, D. C., Smith, M. S., & Ogilvie, D. M. (1967). 
The General Inquirer: A Computer Approach to Content Analysis. Ameri-
can Sociological Review, 4. doi:10.2307/1161774
Iglesias, C., & Moreno, A. (2019). Sentiment Analysis for Social Media. Senti-
ment Analysis for Social Media, 1–4. Retrieved from https://www.mdpi.
com/journal/applsci/special
Jahić, S., & Vičič, J. (2021). Determining sentiment of tweets using first Bo-
snian lexicon and (AnA)-affirmative and non-affirmative words. Advan-
ced technologies, systems, and applications V, 142, 361–373. doi:https://
doi.org/10.1007/978-3-030-54765-3_25
Jahić, S., & Vičič, J. (2023a). Lists of stopwords and AnAwords of Bosnian 
language (1.00) [Data set]. doi:10.5281/zenodo.8021150
Jahić, S., & Vičič, J. (2023b). Sentiment polarity lexicon of Bosnian language. 
361–373. Univerza na Primorskem; CERN. Retrieved from https://zeno-
do.org/record/7520809#.Y8-4L3bMLi0
Jahić, S., & Vičič, J. (2023c). Impact of Negation and AnA-Words on Overall 
Sentiment Value of the Text Written in the Bosnian Language. Applied 
Science, 13, 7760. doi:10.3390/app13137760
Jones, R. L. (2006). An analysis of lexical text coverage in contemporary Ger-
man. In Brill, Language and Computers (pp. 115–120). Leiden, The Ne-
therlands: Brill. doi:https://doi.org/10.1163/9789401202213_010.
Jovanoski, D., Pachovski, V., & Nakov, P. (2015). Sentiment analysis in Twitter 
for Macedonian. Proceedings of the International Conference Recent Advan-
ces in Natural Language Processing (pp. 249–257). Hissar, Bulgaria: INCO-
MA Ltd. Shoumen. Retrieved from https://aclanthology.org/R15-1034
82
Slovenščina 2.0, 2023 (2) | Articles
Kadunc, K. (2016). Določanje sentimenta slovenskim spletnim komentarjem 
s pomočjo strojnega. Ljubljana: Fakulteta za računalništvo in informatiko 
Univerze v Ljubljani. Retrieved from https://repozitorij.uni-lj.si/IzpisGra-
diva.php?lang=eng&id=91182
Kapukaranov, B., & Nakov, P. (2015). Fine-grained sentiment analysis for 
movie reviews in Bulgarian. Proceedings of the International Conference 
Recent Advances in Natural Language Processing (pp. 266–274). Hissar, 
Bulgaria: INCOMA Ltd. Shoumen. Retrieved from https://aclanthology.
org/R15-1036
Kia, D., Soujanya, P., Amir, H., Erik, C., Ahmad, H. Y., Alexander, G., & Qiang, 
Z. (2016). Multilingual Sentiment Analysis: State of the Art and Indepen-
dent Comparison of Techniques. Springer Link – Cognitive Computation, 
8, 757–771. doi:10.1007/s12559-016-9415-7
Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Lin-
guistics, 6(1), 97–133. doi:https://doi.org/10.1075/ijcl.6.1.05kil
Ljubešić, N., & Klubička, F. (2014). bs,hr,srWaC - web corpora of Bosnian, 
Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop 
(WaC-9) (pp. 29–35). Gothenburg, Sweeden: Association for Computati-
onal Linguistics. doi:10.3115/v1/W14- 0405
Moreno-Ortiz, A., & Pérez-Hernández, C. (2018). Lingmotif-lex: a wide-coverage, 
state-of-the-art lexicon for sentiment analysis. Proceedings of the Eleventh 
International Conference on Language Resources and Evaluation (LREC 
2018) (pp. 2653–2659). Miyazaki, Japan: European Language Resources 
Association (ELRA). Retrieved from https://aclanthology.org/L18-1420
Okruhlica, A. (2013). Slovak sentiment lexicon induction in absence of labeled 
data, Master’s Thesis. Comenius University Bratislava.
Osmankadić, M. (2003). A Contribution to the Classification of Intensifiers in 
English and Bosnian. 50–62.
Rayson, P., & Garside, R. (2000). Comparing corpora using frequen-
cy profiling. Proceedings of the Workshop on Comparing Corpora 
WCC’00. 9 (pp. 1–6). USA: Association for Computational Linguistics. 
doi:10.3115/117729.117730
Suciati, A., & Budi, I. (2020). Aspect-Based Sentiment Analysis and Emotion. 
(IJACSA) International Journal of Advanced Computer Science and Ap-
plications, 11(9), 179–186.
Veselovská, K. (2013). Czech subjectivity lexicon : A lexical resource for 
czech polarity classification. Proceedings of the 7th international confe-
rence Slovko (pp. 279–284). Bratislava.
83
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
Vičič, J. (2021). Bosnian news corpus 2021. Retrieved from http://hdl.han-
dle.net/11356/1406
Wawer, A. (2012). Extracting emotive patterns for languages with rich 
morphology. International Journal of Computational Linguistics and Ap-
plications, 11–24.
Wu, F., Shi, Z., Dong, Z., Pand, C., & Zhang, B. (2020). Sentiment Analysis of 
Online Product Reviews Based On SenBERT-CNN. International Confe-
rence on Machine Learning and Cybernetics (ICMLC) (pp. 229–234). Ade-
laide, Australia: IEEE. doi:10.1109/ICMLC51923.2020.9469551
Razpoloženjsko označeni leksikon v bosanskem jeziku 
Prispevek predstavlja prvi razpoloženjsko označeni leksikon bosanskega jezi-
ka. Postopek in metodologija označevanja sta predstavljena skupaj s študijo 
uporabnosti, ki se osredotoča na jezikovno pokritost. Sestava izhodišča je 
bila izvedena s prevajanjem slovenskega označenega leksikona in kasnejšim 
ročnim preverjanjem prevodov in oznak. Jezikovna pokritost je bila preverjana 
z uporabo dveh referenčnih korpusov. Bosanski jezik še vedno velja za jezik 
z malo jezikovnimi viri. Za bosanski jezik je na voljo referenčni korpus, ki ga 
sestavljajo samodejno preiskane spletne strani, vendar so avtorji ugotavljamo, 
da korpus z jasnim časovnim okvirom vsebnega besedila ni dosegljiv. Z zbi-
ranjem novic z več bosanskih spletnih portalov je bil sestavljen korpus sodob-
nih besedil. V raziskavi sta bili uporabljeni dve metodi jezikovnega pokrivanja. 
Pri prvi je bil uporabljen frekvenčni seznam vseh besed, ekstrahiranih iz dveh 
referenčnih korpusov bosanskega jezika, druga metoda pa je prezrla frekvence 
kot glavni dejavnik pri štetju. Izračunana pokritost po prvi predstavljeni metodi 
za prvi korpus je bila 19,24 %, drugi korpus pa 28,05 %. Druga metoda daje 
2,34 % pokritost za prvi korpus in 6,98 % za drugi korpus. Rezultati študije 
predstavljajo jezikovno pokritost, ki je primerljiva s znanimi metodami na tem 
področju. Uporabnost leksikona je bila dokazana že s primerjavo na Twitterju.
Ključne besede: Bosanski leksikon, korpus, analiza sentimenta, potrdilne in 
nepotrdilne besede (PnPbesede), ustavne besede, logaritemska verjetnost, 
označevanje