https://doi.or g/10.31449/inf.v47i2.3788 Informatica 47 (2023) 143–150 143
Automatic Detection of Stop W ords for T exts in the Uzbek Language
Khabibulla Madatov
1
, Shukurla Bekchanov
1
, Jernej V ičič
2, 3
1
Ur gench state university , 14, Kh. Alimdjan str , Ur gench city , 220100, Uzbekistan
2
University of Primorska, UPF AMNIT , E-mail: jernej.vicic@upr .si
3
Research Centre of the Slovenian Academy of Sciences and Arts, The Fran Ramovš Institute
E-mail: habi1972@mail.ru, shukurla15@gmail.com, jernej.vicic@upr .si
Keywords: stop word detection, Uzbek language, agglutinative language, algorithm
Received:
Stop wor ds ar e very important for information r etrieval and text analysis investigation. This study aimed
to automatically analyze and detect stop wor ds in texts in the Uzbek language. Because of the limited
availability of methods for automatic sear ch of stop wor ds of texts in Uzbek we analyzed a newly pr epar ed
corpus. The Uzbek language belongs to the family of agglutinative languages. As with all agglutinative
languages, we can explain that the detection of stop wor ds in Uzbek texts is a mor e complex pr ocess than
in inflected languages: In inflected languages, wor ds such as auxiliary wor ds, articles, pr epositions can
be included in the stop wor ds gr oup. In agglutinative languages, the meanings of such wor ds ar e hidden
in the text. Ther efor e, it is not appr opriate to apply all known methods of stop wor ds detection in inflected
languages dir ectly to agglutinative languages. In this work, the “School corpus” which contains 731 156
Uzbek wor ds has been investigated. The bigram method of analysis was applied to the corpus. W e pr oposed
the collocation method of detecting stop wor ds of the corpus. W e pr oposed the method of automatically
detecting stop wor ds of texts in Uzbek. It is shown that the collocation method is 6 times better than the
bigram method.
Povzetek: Razvita je samodejna analiza in odkrivanje posebnih besed v uzbekistanskem jeziku.
1 Intr oduction
Uzbek language belongs to the Eastern T urkic or Karluk
branch of the T urkic language family . External influences
include Arabic, Persian and Russian. It belongs to the fam-
ily of agglutinative languages. As with all agglutinative
languages, detection of stop words in Uzbek texts is a more
complex process than in inflected languages: in inflected
languages, words such as auxiliary words, articles, prepo-
sitions form most of the stop words group. In agglutinative
languages, the meanings of such words are hidden in the
text. Therefore, it is not suitable to apply all known meth-
ods of stop words detection in inflected languages directly
to agglutinative languages. The experimental results pre-
sented in this work that the use of a hybrid method (combin-
ing grammatical rules and statistical methods) yields best
results in the task of detecting stop words for texts in Uzbek.
As a result of this work we compare this method with bi-
gram method (both methods are thoroughly presented in the
paper). When someone works on a novel, a story , an arti-
cle, or a text, this person uses semantic connection of words
with artistic decoration in their own language to make it
meaningful and interesting. Dealing with sentences, stop
words which do not have an independent meaning or have
little meaning are often used. As a result, the text size in-
creases. As the volume of information increases, the pro-
cess of data processing a nd analysis slows down and as the
search space increases, the quality of results (searches) is
potentially lowered. In such cases, removing unnecessary
words from the text can reduce the amount of information
and increase the ef ficiency of electronic data processing. It
is also important to automate the generation of annotations
and keywords from lar ge volumes of text. The main pur -
pose of identifying unimportant words is to facilitate auto-
matic text analysis.
2 Related works
Stop words are used in a number of tasks involving lan-
guage technologies, such as text generation [1] and T urkic
languages are no exception, an example of applying stop
words to sentiment discovery in T urkish language com-
ments is presented in [2]. Stop words detection methods
can be divided into two basic categories:
1. Based on grammar rules,
2. Statistical methods.
In this work, we use both categories for automatic detec-
tion of stop words in Uzbek texts.
2.1 Based on grammar rules
The sources mainly provide grammatical rules for finding
stop words or a list of stop words for dif ferent languages [3],
144 Informatica 47 (2023) 143–150 Madatov et al.
[4], [5], [6], [7],[8],[9],[10], [1 1]. The text is grammatically
analyzed to identify stop words in Uzbek texts. According
to the definition of stop words, words in the Uzbek language
that are part of a rhyme, conjunctions, introductory words,
adverbs, auxiliary words can be stop words. It is required
to automatically separate them from the given text. Due to
the lack of syntactic analysis programs in the Uzbek lan-
guage, using a dictionary , a list of words that are supposed
to be stop words will be given. In order to create the list of
stop words from the dictionary we investigate and take into
account the definition of stop words. In general pronouns,
adverbs, connectors, Introductory words can be stop words
in Uzbek texts.
2.1.1 Pr onouns
Pronoun is a part of speech used instead of a noun, adjec-
tive, number . The meaning of pronouns and which word
or words they substitute is defined by the context (intra
or inter sentence). According to the meaning and gram-
matical features, the pronoun is divided into generalized
- subject (pronouns - nouns: men (I), sen (Y ou), u (he,
she, it), kim (who), nima (what), hechkim (nobody), hechn-
ima (nothing), generalized - nominal (pronouns-adjectives:
bu (this), shu (this), o’ sha (that), qaysi (which), allaqan-
day (somehow), hechqanday (no), generalized quantita-
tive (pronouns-numerals: qancha (how much), necha (how
many), shuncha (so many), o’ shancha (so many). Pronouns
dif fer from other parts of speech in polysemy , lack of word
formation. Pronouns by meaning and grammatical features
are divided into the following types: pronouns of the person
–men (I), sen (you), u (he,she,it), biz (we), ular (they) used
instead of persons, the proper pronoun - consist of a proper
word denoting an object, strengthening its meaning, em-
phasizing it; indicative pronouns –bu(this), shu (this), o’ sha
(that), u (he, she, it) , ana (that),etc. indicate an object and
its signs; interrogative pronouns indicate the questions as-
kim? (who) nima? (what) qancha? (how many) of the sub-
ject, attribute and quantity; definitive-collective pronoun -
indicates the generalization, generalization of the subject
and its features in relation to hamma (all), bari (all), ba’zan
(sometimes), har nima (anything), har qanday (every/any)
indefinite pronoun - expresses the denial of meaning in re-
lation to hechkim (noone), hechqanday(no), hechqanaqa
(no), hechqaysi (none).
2.1.2 Adverbs
An adverb is one of the independent types of Parts
of speech denotes a sign of an action and a state, as
well as a sign of a sign. There are the following
types of adverb meanings: state adverbs (tez(quick),
sekin(slow), piyoda(on foot)); adverbs of a place (uzo-
qda(far), yaqinda(near), pastda(below); adverbs of time
(hozir(now), kecha(yesterday), bugun(today); adverbs of
quantity (ancha(much), sal(little), kam(few); adverbs
of purpose ataylab(deliberately), jo’rttaga(willingly); ad-
verbs of reason (e.g., noiloj(helplessly), ilojsiz(helplessly),
chorasizlikdan(helplessly). All forms, except for the forms
of time, place and purpose, according to the most general
characteristics, can be attributed to one type and calledas
status forms. An adverb as an independent phrase is char -
acterized by the following morphological features:
– has a category of degree: tez (quick), ko’p(much)
(oddiy daraja(simple degree)) — tezroq (quicker),
ko’proq (more) (comparative degree) — eng tez(the
quickest), judako’p(much more)(superlative degree);
– remains unchanged and is often associated with
verbs: So’ridaqat-qatduxoba ko’rpachalar ustma-ust
to’ shalgan edi(The couch was covered with layers of
velvet mattress);
– an adverb can also be associated with an adjective and
a noun in some places. In such cases, the adverb does
not indicate a sign of a sign or a sign of an object, but
to the adjective to which it is attached, or to a sign
of action understood from a noun: Kecha havo juda
sovuq edi. (It was very cold last night).U hozir beqiyos
va tasavvur qilib bo’lmas baxtiyor edi;(Now , he was
incomparably , unimaginably happy);
– an adverb has suf fixes: -cha, -ona, -larcha, -laband
etc..
– Adverbs are formed in morphological and syntactic
ways: (otlashish hollari bundan mustasno), for each
time, including the moment, as in the moment (syn-
tactic method).
By structure, adverbs are divided into simple (kamtarona
(modest), vijdonan (conscientious), butunlay(whole)),
compound (har dam (always), bir yo’la (together), oz mun-
cha (much), har qachon (always)), paired (kecha-kunduz
(day and night), qishin-yozin (winter and summer) and
repeated (oz-oz (little by little), tez-tez (often), ko’p-
ko’p(many-many). The modal form is considered to be
such forms of verbs anchagina (much), juda (very), kam
(little), kam-kam (little by little). Adverbs act in a sentence
as case, determinant and cut. Adverbs are similar to adjec-
tives in terms of the properties of the expression of a feature,
but dif fer among themselves in grammatical properties: ad-
jectives denote a feature of an object, an adverb that is a sign
of an action or state; their function in the sentence, that is,
syntactic, is also special.
2.1.3 Connectors
Connectors are auxiliary words that serve to link or ganized
parts of a sentence and simple sentences in the structure of a
combined sentence are called connecting words. Auxiliary
words that connect two or more fragments of a sentence or
sentence are called connectors.
Introductory words. Introductory words-words that are
not syntactically related to the sentence. Expresses the
speaker ’ s attitude to the expressed thought (”baxtimga”
Automatic Detection of Stop W ords… Informatica 47 (2023) 143–150 145
(fortunately), ”afsuski”(unfortunately)), the general assess-
ment of the thought (”ehtimol”, ”albatta”), to whom it be-
longs (”menimcha”(to my mind), ”aytishlaricha”(it is said))
or its connection with the preceding thought (”xullas”(so),
”nihoyat”(finally)). W ords used in a sentence in the func-
tion of an introductory word, expressing the speaker ’ s at-
titude to the expressed thought, are called modal words.
Modal words are not independent words, such as a thing,
sign, action, etc., and cannot be part of a sentence. There-
fore, they are not syntactically related to the fragments of
the sentence: “Demak, ishlasa bo‘ladi. Ehtimol, ketmon
bilan yer ag‘darishga ham to‘g‘ri kelar .” (So, it can endure
to work. Perhaps, you may even have to roll over with a
hoe).
2.1.4 Usage
The rules described in previous subsections were used to
detect stop words. The popular explanatory dictionary of
the Uzbek language [3] with 80000 words and detected ap-
proximately 1 100 stop words. These are by definition one-
word stop words as they come from the dictionary .
2.1.5 Statistical method
Consider the statistical method of automatic detection of
stop words in Uzbek texts. In this method, stop words are
found based on the frequency of the word and the frequency
of the inverse document T erm Frequency – Inverse Docu-
ment Frequency – TF-IDF [12]. The number of times of
word occurrence in a text is defined by T erm Frequency –
TF . Inverse Document Frequency – IDF is defined as the
number of texts (documents) being viewed and the pres-
ence of a given word in chosen texts (documents). TF-
IDF is one of the popular methods of knowledge discov-
ery . There are such words that are so common in the text,
however they are almost insignificant in terms of mean-
ing and conversely , there are words that are rare in the
text, but they are very important in terms of the meaning
of the text. In order to increase the impact of meaningful
words and decrease the frequency of words that do not add
up much to the meaning, we multiply TF to IDF . W e see
the statistical method is used as the basis for finding stop
words of many languages. The sources mainly use the TF
IDF method to analyze of the word of the text [4], [13],
[14],[15],[16],[17],[18],[19],[20],[21],[22]. we see the sta-
tistical method is used as the basis for finding stop words
of many languages. These sources mainly use the TF IDF
method to analyze of the word of the text. Several methods
to find stop words for T urkish are given in [18]. Compar -
ing the current work with these sources, we bring scientific
novelty of the article.
3 Methodology
Scientific novelty of the article. First, a collocation method
is proposed for automatically finding stop words of the
Uzbek corpus, consisting of 731 156 words and comparing
with bigram method its advantage is shown. Second, stop
words detecting algorithm is proposed for Uzbek texts.
This section is dedicated to the method of automatic de-
tection of stop words in Uzbek texts. The following proce-
dure was used for detection of stop words.
3.1 Corpus
A corpus named “School corpus” was created using freely
available school books such as “Reading book”, “Mother
tongue” and “Literature”. The texts were downloaded form
Eduportal
1
. T otal number of documents is 25. The moti-
vation behind the selection of the texts for the corpus was
the following:
– everyone enriches personal language dictionary
knowledge during the school period,
– free availability of the texts,
– school textbooks are thoroughly checked for errors,
– mar ginally big enough selection of documents and
length of the documents (taken into account low avail-
ability of Uzbek texts in digital form in general).
Some basic data about the corpus:
– name: School corpus ,
– total number of words: 731 156,
– number of unique words: 47165.
The investigation on finding stop words in the Uzbek
language has shown that in most stop words that are col-
locations, each single word is not a stop word when viewed
as individual word, but when considered as a collocation
word, they become stop words. A few examples that fur -
ther confirm our claim are presented in the Examples 3.1
and 3.2 where the meaning of the sentences is the same.
When viewed as individual words the words bir and mar -
talik are not stop words, but if they are observed as a collo-
cation, they become stop word(s).
Example 3.1. Xalqimiz bir martalik shprits vositasida em-
lanadi. – (Our people are vaccinated with a disposable sy-
ringe.)
Example 3.2. Xalqimiz shprits vositasida emlanadi – (Our
people are vaccinated with syringes.)
Thus, there is a need to expand the problem of finding
stop words which consist of one word. A collocation is con-
sidered if there are 2 or more words. Only a two-word collo-
cation is considered in this article and the motivation behind
this is that three or more word collocations that act as stop
words are not that common, but we still believe that a fur -
ther work needs to be done in this direction. The proposed
methodology does not change for longer collocations.
1
Eduportal: https://eduportal.uz/Eduportal/Barchasi/33
146 Informatica 47 (2023) 143–150 Madatov et al.
3.2 Bigram method
For the purpose of this article, the following definition will
be used: bigrams are pairs of consequent words appearing
in the text. Let’ s consider the use of the bigram method of
finding stop words for the corpus. Algorithm 1 presents the
implementation of the bigram method.
Algorithm 1: The bigram method
1. Consider total occurrencesa
i
,a
i+1
of collocation words in a
corpus. Construct a list of unique pairsUP1 . In our corpus
example, the number of such collocations was 731 155. Among
them 489857 unique pairs.
2. Consider the listUP1 , for each paira
i
,a
i+1
from the list takea
i
and find the word with the biggest bigram probability in the corpus
for the next worda
′ i+1
. There were 90959(a
i
,a
′ i+1
) unique pairs
UP2 .
3. Calculate term frequency (TF) of unique pairsUP2 , for each
document in the corpus. In our example corpus that meant for each
of the 25 documents. W e denote it asDjTF(a
i
,a
′ i+1
),j = 1. 25 .
4. DjTF(a
i
,a
′ i+1
) = k
j
/h
j
, whereh
j
is the number of
occurrences of the pair words in the documentj . k
j
is the number
of unique pairs in documentj .
5. IDF(a
i
,a
′ i+1
) = ln(n/m);n = 25 . m is the number of
documents which include unique pairs(a
i
,a
′ i+1
) , in our example
among 25 documents.
6. W
ij
(a
i
,a
′ i+1
) =
1
25
∑ 25
j=1
IDF(a
i
,a
′ i+1
)∗ D
j
TF(a
i
,a
′ i+1
)
7. W
ij
(a
i
,a
′ i+1
) – weights of unique pairs .
8. W e got5 % of the 90957 unique pairs, whichW
ij
(a
i
,a
′ i+1
) is
close to zero and declare them as stop words.
The Algorithm 1 applied to the ”School corpus” pro-
duced 4548 pairs of words as collocation stop words. A
few examples are presented in Figure 1 bigram.
3.3 The collocation method’ s algorithm
The following definition of collocation will be used
throughout the article: an occurrence of consecutive words
in a corpus. In our case only two-word collocations will be
observed. A (two word) collocation and bigram represent
essentially the same starting set of word pairs, but bigrams
are limited to the most probable pair , collocations take in
consideration all pairs.
A collocation is considered for 2 or more words. In this
article only a two word collocation is considered. The pre-
sented method and derived results are limited to two word
collocations for the sake of simplicity , but the method can
be abstracted to any length. The method should be used
before the single stop word detection method.
T o find collocation stop words, we use the following Al-
gorithm 2:
The Algorithm 2 applied to the ”School corpus” pro-
duced 24490 pairs of words as collocation stop words. A
few examples are presented in Figure 2.
1. chop etildi(published)
2. har bir(each)
3. kitob jamgarmasi(book fu nd)
4. nima uchun(what for)
5. o’rta talim(secondary ed ucation)
6. men ham(me too)
7. bilan birga(along with)
8. yaxshi muqova(good cover )
9. oz vaqtida(It's on time)
10. ham bir(also a)
11. bir necha(a few)
12. barcha varaqlari(all sh eets)
13. o’zi ham(himself)
14. bu yerda(here)
15. bo’lib qoldi(has become )
16. u ham(he too)
17. uchun ham(for both)
18. uning bu(its this)
19. butun darslikning(of th e whole textbook)
20. yangi darslikning(new t extbook)
…………………………………….
4529. Velosiped baxtiga(Luck ily for the bike)
4530. Vodiy daralariga(To t he gorges of the valley)
4531. Voqealarga aralashadi( Interferes with events)
4532. Xarakteri amallari(Cha racter actions)
4533. Xarakterini izohlang(E xplain the character)
4534. xonimning uylariga(to t he lady's house)
4535. xoqonning hayoti(the l ife of a hawk)
4536. xotirasini abadiylasht irish(immortalize the memory)
4537. xudoyor davron(godly e ra)
4538. xushxabar ammo(The go od news, however)
4539. yapon arab(Japanese a rab)
4540. yasagan qayiqlarni(ma de boats)
4541. yasalgan fe’llar(made verbs)
4542. yaxshilar ahbob(good f ellow)
4543. yig’isi alomatning(cr ying symptom)
4544. yig’lagan bolasini(cry ing baby)
4545. yig’och chog’liq(wood chips)
4546. yodlang islom(Remembe r Islam)
4547. yo’lakda bir(one in t he hallway)
4548. yo’llardan biri(one o f the ways)
Figure 1: Examples selected from the list of all stopwords
generated by the bigram Algorithm 1.
3.4 Single word (stop word) detection
algorithm
In this section we consider the single word stop words de-
tecting algorithm based on TFIDT(T erm frequency and in-
verse document frequency) of the word. T o find single
word stop words, we use the following Algorithm 3:
The Algorithm 3 applied to the ”School corpus” pro-
duced 2358 stop words. A few examples are presented in
Figure 4.
3.5 The final stop word detection Algorithm
for Uzbek language
In this section we consider the main algorithm of detect-
ing Stop words of text in Uzbek language. W e bring this
algorithm as in the scheme presented on Figure 3.
Automatic Detection of Stop W ords… Informatica 47 (2023) 143–150 147
Algorithm 2: The collocation method
1. Consider all occurrences of collocations in a corpus. In our case
the total number of such collocations was 731 155. Among them
489857 collocation words are unique collocation words.
2. DjTF(a
i
,a
i+1
) = k
j
/h
j
, whereh
j
is the number of
occurrences of the pair words in the documentj . k
j
is the number
of unique pairs in documentj .
3. IDF(a
i
,a
i+1
) = ln(n/m);n = 25 . m is the number of
documents which include unique pairs, in our example among 25
documents.
4. Wij(a
i
,a
i+1
) =
1
25
∑ 25
j=1
IDF(a
i
,a
i+1
)∗ D
j
TF(a
i
,a
i+1
)
5. Wij(a
i
,a
i+1
) – denotes weight of a collocation –(a
i
a
i+1
) .
6. 5 % of all unique collocations had the weighW
ij
(a
i
,a
i+1
) close
to zero and were declared as stop words.
1. har bir(each)
2. nima uchun(what fo r)
3. bir kuni(one day)
4. o’rta talim(second ary education)
5. uchun darslik(text book for)
6. chop etildi(publis hed)
7. kitob jamg’armasi( book fund)
8. abad ham(forever)
9. abadiy kuchidan(fr om eternal power)
10. abadiy manziliga( to the eternal address)
11. abadiy muhrlanib( sealed forever)
12. abadligi hamda( e ternity and)
13. Abadul abad badn om(Abadul abad badnom)
14. Abadul abad tura jakdur(It will last forever)
15. Abay singari(Abay suchlike)
16. Abbos degan(Abbos named)
17. Abbos qilichi(The sword of Abbas)
18. Abdulaziz qaytib( Abdulaziz returned)
19. Abdulazizga qarad i(He looked at Abdulaziz)
………………………………….
24471. Odamlarni ko’r ishadi(They see people)
24472. Odamlarning ch ehralari(Faces of people)
24473. Odamlarning ha qiga(About people)
24474. Odamlarning ka mligi(Lack of people)
24475. Odamlarning ko ’zidan(From people's eyes)
24476. Odamlarning ko ’zini(People's eyes)
24477. Odamlarning no mlarini(The names of the people)
24478. odamlarning og ’irini(the weight of people)
24479. odamlarning qa ysi(which of the people)
24480. odamlarning va (people and)
24481. odamlarning zi lzila(earthquake of people)
24482. odamligi uni( humanity him)
24483. odamligini ham (that he is human)
24484. odamligini ta’ minlab(providing humanity)
24485. odamlik qiyof asini(human image)
24486. odamman deb(t hat I am human)
24487. odamman degan ini(I mean man)
24488. odamma saxir( I'm sorry)
24489. odamni ajdodl ari(man's ancestors)
24490. odamni ona(mo ther of man)
Figure 2: Examples selected from the list of all stopwords
generated by the collocation Algorithm 2.
Algorithm 3: Single word (stop word) detection algorithm
1. DjTF(a
i
) = k
j
/h
j
, whereh
j
is the number of occurrences of
the pair words in the documentj . k
j
is the number of unique pairs
in documentj .
2. IDF(a
i
) = ln(n/m);n = 25 . m is the number of documents
which include unique pairs, in our example among 25 documents.
3. W
ij
(a
i
) =
1
25
∑ 25
j=1
IDF(a
i
)∗ D
j
TF(a
i
)
4. W
ij
(a
i
) – denotes weight of a word(a
i
.
5. 5 % of of the 47165 unique words, whichW
ij
(a
i
) was close to
zero and declared stop words.
Algorithm 4: find and remove Uzbek stop words from text
(Corpus)
Input(Corpus)
Corpus← T okenize(Corpus)
Dictionary← Extract_From_Dictionary(pronoun,
modal verb, particle, part of a rhyme, conjunctions,
introductory words, adverbs, auxiliary words)
; // Procedure Check(Corpus)
i← 1
while i< len(Corpus) do
if Corpus(i)∈ Dictionary then
Corpus← Corpus− Corpus(i)
i← i+1
/* Procedure Collocation_Two_Words
(Corpus) */
Corpus← T okenize(Corpus)
i← 1
while i< len(Corpus) do
S(i)← token(i)+token(i+1); i← i+1
/* Procedure IDF() */
IDF(S(i))← ln(N/n) ; // N-number of all
documents; n- number of documents,
which include S(i)
/* Procedure TFIDF() */
j← 1
while j< len(Corpus) do
TF(j)← 0
i← j; while i< len(Corpus) - 1 do
if S(j)==S(i) then
TF(j)← TF(j) + 1
i← i+1;
TFIDF(j)← TF(j)*IDF(S(j) if TFIDF(j) close to
zer o then
Dictionary(j)← S(j);
i← 1; while i< len(Corpus) do
if Dictionary(j)== Corpus(i) then
Corpus← Corpus− Dictionary(i);
i← i+1
j← j+1
148 Informatica 47 (2023) 143–150 Madatov et al.
Figure 3: Scheme of the whole process.
1. Abdulla(Abdulla)
2. aka(brother)
3. asosida(based on)
4. ayt(say)
5. aytib(telling)
6. aziz(dear)
7. baho(evaluation)
8. bahor(spring)
9. baland(high)
10. beradi(will give)
11. berdi(gave)
12. berib(giving)
13. berilgan(given)
14. bering(read)
15. bichimi(physique)
…………………
2344. badiiyatni(art)
2345. bag’ayri(past)
2346. bag’rimdami(in my heart)
2347. baid(height)
2348. balladalar(ballads)
2349. banddin(occupied)
2350. Bandksoy (Bandkushoy)
2351. barchalarining(all of them)
2352. barglarga(to the leaves)
2353. bastai(composer)
2354. baxilga(stingy)
2355. baxtdan(happily)
2356. baytallarga(beetles)
2357. bazmni(party)
2358. begonani(outsider)
Figure 4: Examples selected from the list of all stopwords
generated by the single word extraction Algorithm 3.
T able 1: Number of stop words created by each presented
algorithm.
Algorithm Number of stop words
Bigram 4548
Collocation 24490
Single word 2358
4 Results
The first phase of the project consisted of creating a solid
base for corpus linguistics as there were o readily available
corpora for Uzbek language. A corpus named ”School cor -
pus” was created with 731 156 running words. The algo-
rithms for stop words detection ere applied to the aforemen-
tioned corpus and T able 1.
5 Data availability
The presented automatically extracted lists (a list for each
described method) are freely available at Zenodo reposi-
tory [23]: https://doi.or g/10.5281/zenodo.6319953
6 Conclusion and further work
The article presents the first attempt at the automatic detec-
tion of stop words for Uzbek language. A corpus named
Automatic Detection of Stop W ords… Informatica 47 (2023) 143–150 149
Figure 5: Number of stop words for each algorithm applied to the ”School corpus”.
”School corpus” was created for this purpose, it contains
25 documents and 731 155 running words, of which 47165
are unique words. Three methods were applied to the cor -
pus in order to extract (or detect) stop words: a method that
extracts single word stop words and two methods that aim
at pairs of words, a bigram and collocation method. Each
method is described and presented in a form of an algo-
rithm. The methods can be used in a series and the results
can be added together to form the final list of stop words.
T aking account the conception of stop words depending
on the text every word can be stop words. According to
this approach (based on TFIDF). A quick comparison of the
methods shows an increase in stop words detection using
the collocation method
This research is believed to support other works in
Uzbek, not only in the field of automatic stopword detec-
tion, but also other related NLP areas [24], such as Uzbek
W ordNet [25], opinion mining [26], or semantic analysis
[27].
Refer ences
[1] P . T omašič, G. Papa, and M. Žnidaršič, “Using a
genetic algorithm to produce slogans,” Informatica ,
vol. 39, no. 2, 2015.
[2] R. Y ayla and T . T . Bilgin, “Determining of the
user attitudes on mobile security programs with
machine learning methods,” Informatica (Slovenia) ,
2021. [Online]. A vailable: https://doi.org/10.31449/
inf.v45i3.3506
[3] S. Matlatipov , X. Madatov , G. Matlatipov ,
A. O‘razbayev , M. Raximboyev , I. A vezmatov ,
U. Babajanov , L. Kurbanova, D. Xujamov , and
D. Matjumayeva, “”o‘zbek tilining statistik electron
lug‘at” exm dasturi uchun guvohnoma,” Intellektual
mulk agentligi , 2020.
[4] A. W . Pradana and M. Hayaty , “The ef fect of
stemming and removal of stop words on the accuracy
of sentiment analysis on indonesian-language texts,”
Game T echnology , Information System, Computer
Network, Computing, Electr onics, and Contr ol Jour -
nal , vol. 4, no. 3, pp. 277–288, 2019. [Online].
A vailable: https://doi.org/10.22219/kinetik.v4i4.912
[5] R. U. Haque, P . Mehera, M. F . Mridha, and
M. A. Hamid, “A complete bengali stop word
detection mechanism,” in Confer ence Paper ∙ May
2019 . Conference, 2019. [Online]. A vailable: https:
//doi.org/10.1109/ICIEV.2019.8858544
[6] R. Rania and D.K.Lobiyal, “Automatic construction
of generic stop words list for hindi text,” in Interna-
tional Confer ence on Computational Intelligence and
Data Science , vol. 132, International Conference on
Computational Intelligence and Data Science. IC-
CIDS 2018, 2018, pp. 362–370.
150 Informatica 47 (2023) 143–150 Madatov et al.
[7] P . J. Burns, “Constructing stoplists for historical
languages,” Digital Classics Online , vol. 4, no. 2,
2018. [Online]. A vailable: https://doi.org/10.11588/
dco.2018.2.52124
[8] R. M. Rakholia and J. R. Saini, “A rule-based ap-
proach to identify stop words for gujarati language,”
in In Pr oceedings of the 5th International Confer ence
on Fr ontiers in Intelligent Computing: Theory and
Applications , 2017, pp. 797–806.
[9] J. K. Raulji and J. R. Saini, “Generating stopword list
for sanskrit language,” in In: 2017 IEEE 7th Interna-
tional Advance Computing Confer ence . IEEE 7th,
2017, pp. 799–802.
[10] O. D. T ijani, A. T . Akinwale, S. A. Onashoga, and
E. O. Adeleke, “An auto-generated approach of stop
words using aggregated analysis,” in In: Pr oceedings
of the 13th International Confer ence of the Nigeria
Computer Society , 2017, pp. 99–1 15.
[1 1] M. Mhatre, D. Phondekar , P . Kadam, A. Chawathe,
and K. Ghag, “Dimensionality reduction for senti-
ment analysis using pre-processing techniques,” in In
Pr oceedings of the IEEE 2017 International Confer -
ence on Computing Methodologies and Communica-
tion . ICCMC, 2017, pp. 16–21. [Online]. A vailable:
https://doi.org/10.1109/ICCMC.2017.8282676
[12] C. Sammut and G. I. W ebb, Eds., TF–IDF .
Boston, MA: Springer US, 2010, pp. 986–
987. [Online]. A vailable: https://doi.org/10.1007/
978- 0- 387- 30164- 8_832
[13] Y . W ang, K. Kim, B. Lee, and H. Y . Y oun,
“W ord clustering based on pos feature for ef ficient
twitter sentiment analysis,” Human-centric Comput ,
vol. 8, no. 17, pp. 1–25, 2019. [Online]. A vailable:
https://doi.org/10.1186/s13673- 018- 0140- y
[14] N. Ousirimaneechai and S. Sinthupinyo, “Extraction
of trend keywords and stop words from thai facebook
pages using character n-grams,” International Jour -
nal of Machine Learning and Computing , vol. 8, no. 6,
2018.
[15] C. Slamet, A. R. Atmadja, D. S. Maylawati, R. S.
Lestari, W . Dharmalaksana, and M. A. Ramdhani,
“Automated text summarization for indonesian article
using vector space model model,” in IOP Conf. Ser .
Mater . Sci. Eng. , vol. 288, no. 1, Conference. IOP ,
2018. [Online]. A vailable: https://doi.org/10.1088/
1757- 899X/288/1/012037
[16] G. Li and J. Li, “Research on sentiment
classification for tang poetry based on tf-idf
and fp-growth,” in Pr oceedings of 2018 IEEE
3r d Advanced Information T echnology , Electronic
and Automation Control Conference. IAEAC,
2018, pp. 630–634. [Online]. A vailable: https:
//doi.org/10.1109/IAEAC.2018.8577715
[17] H. M. Zin, N. Mustapha, M. A. A. Murad, and N. M.
Sharef, “The ef fects of pre-processing strategies
in sentiment analysis of online movie reviews,”
in AIP Conf. Pr oc. , vol. 1891, no. 1. AIP
Conf., 2017, pp. 1–7. [Online]. A vailable: https:
//doi.org/10.1063/1.5005422
[18] S. K. Metin and B. Karaog’lan, “Stop word detec-
tion as a binary classification problem,” Anadolu
University Journal of Science and T echnology
A- Applied Sciences and Engineering , vol. 18,
no. 2, pp. 346–359, 2017. [Online]. A vailable:
https://doi.org/10.18038/aubtda.322136
[19] J. K. Raulji and J. R. Saini, “Generating stop word
list for sanskrit language,” in In Advance Computing
Confer ence IEEE 7th International . IEEE, 2017, pp.
799–802.
[20] S. J. R. Rakholia R. M., “A rule-based approach to
identify stop words for gujarati language,” in Sur esh
Chandra Satapathy V ikrant Bhateja Siba K. , 2017.
[21] R. M. Rakholia and J. R. Saini, “Information re-
trieval for gujarati language using cosine similarity
based vector space model,” in Theory and Applica-
tions . Springer_Singapore, 2017, pp. 1–9.
[22] X.Madatov and S. Matlatipov , “Kosinus o’xshahshlik
va uning o’zbek tili matnlariga tatbiqi haqida,”
O’zMU xabarlari , vol. 2, no. 1, 2016.
[23] K. Madatov , S. Bekchanov , and J. V ičič, “Lists
of uzbek stopwords (1.1) [data set],” Zenodo. [On-
line]. A vailable: \url{https://doi.org/10.5281/zenodo.
6319953}
[24] K. Madatov , S. Bekchanov , and J. V ičič, “Dataset of
stopwords extracted from uzbek texts,” Data in Brief ,
vol. 43, p. 108351, 2022.
[25] K. A. Madatov , D. Khujamov , and B. Boltayev , “Cre-
ating of the uzbek wordnet based on turkish word-
net,” in AIP Confer ence Pr o ceedings , vol. 2432, no. 1.
AIP Publishing LLC, 2022, p. 060009.
[26] S. Matlatipov , H. Rahimboeva, J. Rajabov , and
E. Kuriyozov , “Uzbek sentiment analysis based
on local restaurant reviews,” arXiv pr eprint
arXiv:2205.15930 , 2022.
[27] U. Salaev , E. Kuriyozov , and C. Gómez-Rodríguez,
“Simreluz: Similarity and relatedness scores as a se-
mantic evaluation dataset for uzbek language,” arXiv
pr eprint arXiv:2205.06072 , 2022.