https://doi.or g/10.31449/inf.v47i2.3788 Informatica 47 (2023) 143–150 143 Automatic Detection of Stop W ords for T exts in the Uzbek Language Khabibulla Madatov 1 , Shukurla Bekchanov 1 , Jernej V ičič 2, 3 1 Ur gench state university , 14, Kh. Alimdjan str , Ur gench city , 220100, Uzbekistan 2 University of Primorska, UPF AMNIT , E-mail: jernej.vicic@upr .si 3 Research Centre of the Slovenian Academy of Sciences and Arts, The Fran Ramovš Institute E-mail: habi1972@mail.ru, shukurla15@gmail.com, jernej.vicic@upr .si Keywords: stop word detection, Uzbek language, agglutinative language, algorithm Received: Stop wor ds ar e very important for information r etrieval and text analysis investigation. This study aimed to automatically analyze and detect stop wor ds in texts in the Uzbek language. Because of the limited availability of methods for automatic sear ch of stop wor ds of texts in Uzbek we analyzed a newly pr epar ed corpus. The Uzbek language belongs to the family of agglutinative languages. As with all agglutinative languages, we can explain that the detection of stop wor ds in Uzbek texts is a mor e complex pr ocess than in inflected languages: In inflected languages, wor ds such as auxiliary wor ds, articles, pr epositions can be included in the stop wor ds gr oup. In agglutinative languages, the meanings of such wor ds ar e hidden in the text. Ther efor e, it is not appr opriate to apply all known methods of stop wor ds detection in inflected languages dir ectly to agglutinative languages. In this work, the “School corpus” which contains 731 156 Uzbek wor ds has been investigated. The bigram method of analysis was applied to the corpus. W e pr oposed the collocation method of detecting stop wor ds of the corpus. W e pr oposed the method of automatically detecting stop wor ds of texts in Uzbek. It is shown that the collocation method is 6 times better than the bigram method. Povzetek: Razvita je samodejna analiza in odkrivanje posebnih besed v uzbekistanskem jeziku. 1 Intr oduction Uzbek language belongs to the Eastern T urkic or Karluk branch of the T urkic language family . External influences include Arabic, Persian and Russian. It belongs to the fam- ily of agglutinative languages. As with all agglutinative languages, detection of stop words in Uzbek texts is a more complex process than in inflected languages: in inflected languages, words such as auxiliary words, articles, prepo- sitions form most of the stop words group. In agglutinative languages, the meanings of such words are hidden in the text. Therefore, it is not suitable to apply all known meth- ods of stop words detection in inflected languages directly to agglutinative languages. The experimental results pre- sented in this work that the use of a hybrid method (combin- ing grammatical rules and statistical methods) yields best results in the task of detecting stop words for texts in Uzbek. As a result of this work we compare this method with bi- gram method (both methods are thoroughly presented in the paper). When someone works on a novel, a story , an arti- cle, or a text, this person uses semantic connection of words with artistic decoration in their own language to make it meaningful and interesting. Dealing with sentences, stop words which do not have an independent meaning or have little meaning are often used. As a result, the text size in- creases. As the volume of information increases, the pro- cess of data processing a nd analysis slows down and as the search space increases, the quality of results (searches) is potentially lowered. In such cases, removing unnecessary words from the text can reduce the amount of information and increase the ef ficiency of electronic data processing. It is also important to automate the generation of annotations and keywords from lar ge volumes of text. The main pur - pose of identifying unimportant words is to facilitate auto- matic text analysis. 2 Related works Stop words are used in a number of tasks involving lan- guage technologies, such as text generation [1] and T urkic languages are no exception, an example of applying stop words to sentiment discovery in T urkish language com- ments is presented in [2]. Stop words detection methods can be divided into two basic categories: 1. Based on grammar rules, 2. Statistical methods. In this work, we use both categories for automatic detec- tion of stop words in Uzbek texts. 2.1 Based on grammar rules The sources mainly provide grammatical rules for finding stop words or a list of stop words for dif ferent languages [3], 144 Informatica 47 (2023) 143–150 Madatov et al. [4], [5], [6], [7],[8],[9],[10], [1 1]. The text is grammatically analyzed to identify stop words in Uzbek texts. According to the definition of stop words, words in the Uzbek language that are part of a rhyme, conjunctions, introductory words, adverbs, auxiliary words can be stop words. It is required to automatically separate them from the given text. Due to the lack of syntactic analysis programs in the Uzbek lan- guage, using a dictionary , a list of words that are supposed to be stop words will be given. In order to create the list of stop words from the dictionary we investigate and take into account the definition of stop words. In general pronouns, adverbs, connectors, Introductory words can be stop words in Uzbek texts. 2.1.1 Pr onouns Pronoun is a part of speech used instead of a noun, adjec- tive, number . The meaning of pronouns and which word or words they substitute is defined by the context (intra or inter sentence). According to the meaning and gram- matical features, the pronoun is divided into generalized - subject (pronouns - nouns: men (I), sen (Y ou), u (he, she, it), kim (who), nima (what), hechkim (nobody), hechn- ima (nothing), generalized - nominal (pronouns-adjectives: bu (this), shu (this), o’ sha (that), qaysi (which), allaqan- day (somehow), hechqanday (no), generalized quantita- tive (pronouns-numerals: qancha (how much), necha (how many), shuncha (so many), o’ shancha (so many). Pronouns dif fer from other parts of speech in polysemy , lack of word formation. Pronouns by meaning and grammatical features are divided into the following types: pronouns of the person –men (I), sen (you), u (he,she,it), biz (we), ular (they) used instead of persons, the proper pronoun - consist of a proper word denoting an object, strengthening its meaning, em- phasizing it; indicative pronouns –bu(this), shu (this), o’ sha (that), u (he, she, it) , ana (that),etc. indicate an object and its signs; interrogative pronouns indicate the questions as- kim? (who) nima? (what) qancha? (how many) of the sub- ject, attribute and quantity; definitive-collective pronoun - indicates the generalization, generalization of the subject and its features in relation to hamma (all), bari (all), ba’zan (sometimes), har nima (anything), har qanday (every/any) indefinite pronoun - expresses the denial of meaning in re- lation to hechkim (noone), hechqanday(no), hechqanaqa (no), hechqaysi (none). 2.1.2 Adverbs An adverb is one of the independent types of Parts of speech denotes a sign of an action and a state, as well as a sign of a sign. There are the following types of adverb meanings: state adverbs (tez(quick), sekin(slow), piyoda(on foot)); adverbs of a place (uzo- qda(far), yaqinda(near), pastda(below); adverbs of time (hozir(now), kecha(yesterday), bugun(today); adverbs of quantity (ancha(much), sal(little), kam(few); adverbs of purpose ataylab(deliberately), jo’rttaga(willingly); ad- verbs of reason (e.g., noiloj(helplessly), ilojsiz(helplessly), chorasizlikdan(helplessly). All forms, except for the forms of time, place and purpose, according to the most general characteristics, can be attributed to one type and calledas status forms. An adverb as an independent phrase is char - acterized by the following morphological features: – has a category of degree: tez (quick), ko’p(much) (oddiy daraja(simple degree)) — tezroq (quicker), ko’proq (more) (comparative degree) — eng tez(the quickest), judako’p(much more)(superlative degree); – remains unchanged and is often associated with verbs: So’ridaqat-qatduxoba ko’rpachalar ustma-ust to’ shalgan edi(The couch was covered with layers of velvet mattress); – an adverb can also be associated with an adjective and a noun in some places. In such cases, the adverb does not indicate a sign of a sign or a sign of an object, but to the adjective to which it is attached, or to a sign of action understood from a noun: Kecha havo juda sovuq edi. (It was very cold last night).U hozir beqiyos va tasavvur qilib bo’lmas baxtiyor edi;(Now , he was incomparably , unimaginably happy); – an adverb has suf fixes: -cha, -ona, -larcha, -laband etc.. – Adverbs are formed in morphological and syntactic ways: (otlashish hollari bundan mustasno), for each time, including the moment, as in the moment (syn- tactic method). By structure, adverbs are divided into simple (kamtarona (modest), vijdonan (conscientious), butunlay(whole)), compound (har dam (always), bir yo’la (together), oz mun- cha (much), har qachon (always)), paired (kecha-kunduz (day and night), qishin-yozin (winter and summer) and repeated (oz-oz (little by little), tez-tez (often), ko’p- ko’p(many-many). The modal form is considered to be such forms of verbs anchagina (much), juda (very), kam (little), kam-kam (little by little). Adverbs act in a sentence as case, determinant and cut. Adverbs are similar to adjec- tives in terms of the properties of the expression of a feature, but dif fer among themselves in grammatical properties: ad- jectives denote a feature of an object, an adverb that is a sign of an action or state; their function in the sentence, that is, syntactic, is also special. 2.1.3 Connectors Connectors are auxiliary words that serve to link or ganized parts of a sentence and simple sentences in the structure of a combined sentence are called connecting words. Auxiliary words that connect two or more fragments of a sentence or sentence are called connectors. Introductory words. Introductory words-words that are not syntactically related to the sentence. Expresses the speaker ’ s attitude to the expressed thought (”baxtimga” Automatic Detection of Stop W ords… Informatica 47 (2023) 143–150 145 (fortunately), ”afsuski”(unfortunately)), the general assess- ment of the thought (”ehtimol”, ”albatta”), to whom it be- longs (”menimcha”(to my mind), ”aytishlaricha”(it is said)) or its connection with the preceding thought (”xullas”(so), ”nihoyat”(finally)). W ords used in a sentence in the func- tion of an introductory word, expressing the speaker ’ s at- titude to the expressed thought, are called modal words. Modal words are not independent words, such as a thing, sign, action, etc., and cannot be part of a sentence. There- fore, they are not syntactically related to the fragments of the sentence: “Demak, ishlasa bo‘ladi. Ehtimol, ketmon bilan yer ag‘darishga ham to‘g‘ri kelar .” (So, it can endure to work. Perhaps, you may even have to roll over with a hoe). 2.1.4 Usage The rules described in previous subsections were used to detect stop words. The popular explanatory dictionary of the Uzbek language [3] with 80000 words and detected ap- proximately 1 100 stop words. These are by definition one- word stop words as they come from the dictionary . 2.1.5 Statistical method Consider the statistical method of automatic detection of stop words in Uzbek texts. In this method, stop words are found based on the frequency of the word and the frequency of the inverse document T erm Frequency – Inverse Docu- ment Frequency – TF-IDF [12]. The number of times of word occurrence in a text is defined by T erm Frequency – TF . Inverse Document Frequency – IDF is defined as the number of texts (documents) being viewed and the pres- ence of a given word in chosen texts (documents). TF- IDF is one of the popular methods of knowledge discov- ery . There are such words that are so common in the text, however they are almost insignificant in terms of mean- ing and conversely , there are words that are rare in the text, but they are very important in terms of the meaning of the text. In order to increase the impact of meaningful words and decrease the frequency of words that do not add up much to the meaning, we multiply TF to IDF . W e see the statistical method is used as the basis for finding stop words of many languages. The sources mainly use the TF IDF method to analyze of the word of the text [4], [13], [14],[15],[16],[17],[18],[19],[20],[21],[22]. we see the sta- tistical method is used as the basis for finding stop words of many languages. These sources mainly use the TF IDF method to analyze of the word of the text. Several methods to find stop words for T urkish are given in [18]. Compar - ing the current work with these sources, we bring scientific novelty of the article. 3 Methodology Scientific novelty of the article. First, a collocation method is proposed for automatically finding stop words of the Uzbek corpus, consisting of 731 156 words and comparing with bigram method its advantage is shown. Second, stop words detecting algorithm is proposed for Uzbek texts. This section is dedicated to the method of automatic de- tection of stop words in Uzbek texts. The following proce- dure was used for detection of stop words. 3.1 Corpus A corpus named “School corpus” was created using freely available school books such as “Reading book”, “Mother tongue” and “Literature”. The texts were downloaded form Eduportal 1 . T otal number of documents is 25. The moti- vation behind the selection of the texts for the corpus was the following: – everyone enriches personal language dictionary knowledge during the school period, – free availability of the texts, – school textbooks are thoroughly checked for errors, – mar ginally big enough selection of documents and length of the documents (taken into account low avail- ability of Uzbek texts in digital form in general). Some basic data about the corpus: – name: School corpus , – total number of words: 731 156, – number of unique words: 47165. The investigation on finding stop words in the Uzbek language has shown that in most stop words that are col- locations, each single word is not a stop word when viewed as individual word, but when considered as a collocation word, they become stop words. A few examples that fur - ther confirm our claim are presented in the Examples 3.1 and 3.2 where the meaning of the sentences is the same. When viewed as individual words the words bir and mar - talik are not stop words, but if they are observed as a collo- cation, they become stop word(s). Example 3.1. Xalqimiz bir martalik shprits vositasida em- lanadi. – (Our people are vaccinated with a disposable sy- ringe.) Example 3.2. Xalqimiz shprits vositasida emlanadi – (Our people are vaccinated with syringes.) Thus, there is a need to expand the problem of finding stop words which consist of one word. A collocation is con- sidered if there are 2 or more words. Only a two-word collo- cation is considered in this article and the motivation behind this is that three or more word collocations that act as stop words are not that common, but we still believe that a fur - ther work needs to be done in this direction. The proposed methodology does not change for longer collocations. 1 Eduportal: https://eduportal.uz/Eduportal/Barchasi/33 146 Informatica 47 (2023) 143–150 Madatov et al. 3.2 Bigram method For the purpose of this article, the following definition will be used: bigrams are pairs of consequent words appearing in the text. Let’ s consider the use of the bigram method of finding stop words for the corpus. Algorithm 1 presents the implementation of the bigram method. Algorithm 1: The bigram method 1. Consider total occurrencesa i ,a i+1 of collocation words in a corpus. Construct a list of unique pairsUP1 . In our corpus example, the number of such collocations was 731 155. Among them 489857 unique pairs. 2. Consider the listUP1 , for each paira i ,a i+1 from the list takea i and find the word with the biggest bigram probability in the corpus for the next worda ′ i+1 . There were 90959(a i ,a ′ i+1 ) unique pairs UP2 . 3. Calculate term frequency (TF) of unique pairsUP2 , for each document in the corpus. In our example corpus that meant for each of the 25 documents. W e denote it asDjTF(a i ,a ′ i+1 ),j = 1. 25 . 4. DjTF(a i ,a ′ i+1 ) = k j /h j , whereh j is the number of occurrences of the pair words in the documentj . k j is the number of unique pairs in documentj . 5. IDF(a i ,a ′ i+1 ) = ln(n/m);n = 25 . m is the number of documents which include unique pairs(a i ,a ′ i+1 ) , in our example among 25 documents. 6. W ij (a i ,a ′ i+1 ) = 1 25 ∑ 25 j=1 IDF(a i ,a ′ i+1 )∗ D j TF(a i ,a ′ i+1 ) 7. W ij (a i ,a ′ i+1 ) – weights of unique pairs . 8. W e got5 % of the 90957 unique pairs, whichW ij (a i ,a ′ i+1 ) is close to zero and declare them as stop words. The Algorithm 1 applied to the ”School corpus” pro- duced 4548 pairs of words as collocation stop words. A few examples are presented in Figure 1 bigram. 3.3 The collocation method’ s algorithm The following definition of collocation will be used throughout the article: an occurrence of consecutive words in a corpus. In our case only two-word collocations will be observed. A (two word) collocation and bigram represent essentially the same starting set of word pairs, but bigrams are limited to the most probable pair , collocations take in consideration all pairs. A collocation is considered for 2 or more words. In this article only a two word collocation is considered. The pre- sented method and derived results are limited to two word collocations for the sake of simplicity , but the method can be abstracted to any length. The method should be used before the single stop word detection method. T o find collocation stop words, we use the following Al- gorithm 2: The Algorithm 2 applied to the ”School corpus” pro- duced 24490 pairs of words as collocation stop words. A few examples are presented in Figure 2. 1. chop etildi(published) 2. har bir(each) 3. kitob jamgarmasi(book fu nd) 4. nima uchun(what for) 5. o’rta talim(secondary ed ucation) 6. men ham(me too) 7. bilan birga(along with) 8. yaxshi muqova(good cover ) 9. oz vaqtida(It's on time) 10. ham bir(also a) 11. bir necha(a few) 12. barcha varaqlari(all sh eets) 13. o’zi ham(himself) 14. bu yerda(here) 15. bo’lib qoldi(has become ) 16. u ham(he too) 17. uchun ham(for both) 18. uning bu(its this) 19. butun darslikning(of th e whole textbook) 20. yangi darslikning(new t extbook) ……………………………………. 4529. Velosiped baxtiga(Luck ily for the bike) 4530. Vodiy daralariga(To t he gorges of the valley) 4531. Voqealarga aralashadi( Interferes with events) 4532. Xarakteri amallari(Cha racter actions) 4533. Xarakterini izohlang(E xplain the character) 4534. xonimning uylariga(to t he lady's house) 4535. xoqonning hayoti(the l ife of a hawk) 4536. xotirasini abadiylasht irish(immortalize the memory) 4537. xudoyor davron(godly e ra) 4538. xushxabar ammo(The go od news, however) 4539. yapon arab(Japanese a rab) 4540. yasagan qayiqlarni(ma de boats) 4541. yasalgan fe’llar(made verbs) 4542. yaxshilar ahbob(good f ellow) 4543. yig’isi alomatning(cr ying symptom) 4544. yig’lagan bolasini(cry ing baby) 4545. yig’och chog’liq(wood chips) 4546. yodlang islom(Remembe r Islam) 4547. yo’lakda bir(one in t he hallway) 4548. yo’llardan biri(one o f the ways) Figure 1: Examples selected from the list of all stopwords generated by the bigram Algorithm 1. 3.4 Single word (stop word) detection algorithm In this section we consider the single word stop words de- tecting algorithm based on TFIDT(T erm frequency and in- verse document frequency) of the word. T o find single word stop words, we use the following Algorithm 3: The Algorithm 3 applied to the ”School corpus” pro- duced 2358 stop words. A few examples are presented in Figure 4. 3.5 The final stop word detection Algorithm for Uzbek language In this section we consider the main algorithm of detect- ing Stop words of text in Uzbek language. W e bring this algorithm as in the scheme presented on Figure 3. Automatic Detection of Stop W ords… Informatica 47 (2023) 143–150 147 Algorithm 2: The collocation method 1. Consider all occurrences of collocations in a corpus. In our case the total number of such collocations was 731 155. Among them 489857 collocation words are unique collocation words. 2. DjTF(a i ,a i+1 ) = k j /h j , whereh j is the number of occurrences of the pair words in the documentj . k j is the number of unique pairs in documentj . 3. IDF(a i ,a i+1 ) = ln(n/m);n = 25 . m is the number of documents which include unique pairs, in our example among 25 documents. 4. Wij(a i ,a i+1 ) = 1 25 ∑ 25 j=1 IDF(a i ,a i+1 )∗ D j TF(a i ,a i+1 ) 5. Wij(a i ,a i+1 ) – denotes weight of a collocation –(a i a i+1 ) . 6. 5 % of all unique collocations had the weighW ij (a i ,a i+1 ) close to zero and were declared as stop words. 1. har bir(each) 2. nima uchun(what fo r) 3. bir kuni(one day) 4. o’rta talim(second ary education) 5. uchun darslik(text book for) 6. chop etildi(publis hed) 7. kitob jamg’armasi( book fund) 8. abad ham(forever) 9. abadiy kuchidan(fr om eternal power) 10. abadiy manziliga( to the eternal address) 11. abadiy muhrlanib( sealed forever) 12. abadligi hamda( e ternity and) 13. Abadul abad badn om(Abadul abad badnom) 14. Abadul abad tura jakdur(It will last forever) 15. Abay singari(Abay suchlike) 16. Abbos degan(Abbos named) 17. Abbos qilichi(The sword of Abbas) 18. Abdulaziz qaytib( Abdulaziz returned) 19. Abdulazizga qarad i(He looked at Abdulaziz) …………………………………. 24471. Odamlarni ko’r ishadi(They see people) 24472. Odamlarning ch ehralari(Faces of people) 24473. Odamlarning ha qiga(About people) 24474. Odamlarning ka mligi(Lack of people) 24475. Odamlarning ko ’zidan(From people's eyes) 24476. Odamlarning ko ’zini(People's eyes) 24477. Odamlarning no mlarini(The names of the people) 24478. odamlarning og ’irini(the weight of people) 24479. odamlarning qa ysi(which of the people) 24480. odamlarning va (people and) 24481. odamlarning zi lzila(earthquake of people) 24482. odamligi uni( humanity him) 24483. odamligini ham (that he is human) 24484. odamligini ta’ minlab(providing humanity) 24485. odamlik qiyof asini(human image) 24486. odamman deb(t hat I am human) 24487. odamman degan ini(I mean man) 24488. odamma saxir( I'm sorry) 24489. odamni ajdodl ari(man's ancestors) 24490. odamni ona(mo ther of man) Figure 2: Examples selected from the list of all stopwords generated by the collocation Algorithm 2. Algorithm 3: Single word (stop word) detection algorithm 1. DjTF(a i ) = k j /h j , whereh j is the number of occurrences of the pair words in the documentj . k j is the number of unique pairs in documentj . 2. IDF(a i ) = ln(n/m);n = 25 . m is the number of documents which include unique pairs, in our example among 25 documents. 3. W ij (a i ) = 1 25 ∑ 25 j=1 IDF(a i )∗ D j TF(a i ) 4. W ij (a i ) – denotes weight of a word(a i . 5. 5 % of of the 47165 unique words, whichW ij (a i ) was close to zero and declared stop words. Algorithm 4: find and remove Uzbek stop words from text (Corpus) Input(Corpus) Corpus← T okenize(Corpus) Dictionary← Extract_From_Dictionary(pronoun, modal verb, particle, part of a rhyme, conjunctions, introductory words, adverbs, auxiliary words) ; // Procedure Check(Corpus) i← 1 while i< len(Corpus) do if Corpus(i)∈ Dictionary then Corpus← Corpus− Corpus(i) i← i+1 /* Procedure Collocation_Two_Words (Corpus) */ Corpus← T okenize(Corpus) i← 1 while i< len(Corpus) do S(i)← token(i)+token(i+1); i← i+1 /* Procedure IDF() */ IDF(S(i))← ln(N/n) ; // N-number of all documents; n- number of documents, which include S(i) /* Procedure TFIDF() */ j← 1 while j< len(Corpus) do TF(j)← 0 i← j; while i< len(Corpus) - 1 do if S(j)==S(i) then TF(j)← TF(j) + 1 i← i+1; TFIDF(j)← TF(j)*IDF(S(j) if TFIDF(j) close to zer o then Dictionary(j)← S(j); i← 1; while i< len(Corpus) do if Dictionary(j)== Corpus(i) then Corpus← Corpus− Dictionary(i); i← i+1 j← j+1 148 Informatica 47 (2023) 143–150 Madatov et al. Figure 3: Scheme of the whole process. 1. Abdulla(Abdulla) 2. aka(brother) 3. asosida(based on) 4. ayt(say) 5. aytib(telling) 6. aziz(dear) 7. baho(evaluation) 8. bahor(spring) 9. baland(high) 10. beradi(will give) 11. berdi(gave) 12. berib(giving) 13. berilgan(given) 14. bering(read) 15. bichimi(physique) ………………… 2344. badiiyatni(art) 2345. bag’ayri(past) 2346. bag’rimdami(in my heart) 2347. baid(height) 2348. balladalar(ballads) 2349. banddin(occupied) 2350. Bandksoy (Bandkushoy) 2351. barchalarining(all of them) 2352. barglarga(to the leaves) 2353. bastai(composer) 2354. baxilga(stingy) 2355. baxtdan(happily) 2356. baytallarga(beetles) 2357. bazmni(party) 2358. begonani(outsider) Figure 4: Examples selected from the list of all stopwords generated by the single word extraction Algorithm 3. T able 1: Number of stop words created by each presented algorithm. Algorithm Number of stop words Bigram 4548 Collocation 24490 Single word 2358 4 Results The first phase of the project consisted of creating a solid base for corpus linguistics as there were o readily available corpora for Uzbek language. A corpus named ”School cor - pus” was created with 731 156 running words. The algo- rithms for stop words detection ere applied to the aforemen- tioned corpus and T able 1. 5 Data availability The presented automatically extracted lists (a list for each described method) are freely available at Zenodo reposi- tory [23]: https://doi.or g/10.5281/zenodo.6319953 6 Conclusion and further work The article presents the first attempt at the automatic detec- tion of stop words for Uzbek language. A corpus named Automatic Detection of Stop W ords… Informatica 47 (2023) 143–150 149 Figure 5: Number of stop words for each algorithm applied to the ”School corpus”. ”School corpus” was created for this purpose, it contains 25 documents and 731 155 running words, of which 47165 are unique words. Three methods were applied to the cor - pus in order to extract (or detect) stop words: a method that extracts single word stop words and two methods that aim at pairs of words, a bigram and collocation method. Each method is described and presented in a form of an algo- rithm. The methods can be used in a series and the results can be added together to form the final list of stop words. T aking account the conception of stop words depending on the text every word can be stop words. According to this approach (based on TFIDF). A quick comparison of the methods shows an increase in stop words detection using the collocation method This research is believed to support other works in Uzbek, not only in the field of automatic stopword detec- tion, but also other related NLP areas [24], such as Uzbek W ordNet [25], opinion mining [26], or semantic analysis [27]. Refer ences [1] P . T omašič, G. Papa, and M. Žnidaršič, “Using a genetic algorithm to produce slogans,” Informatica , vol. 39, no. 2, 2015. [2] R. Y ayla and T . T . Bilgin, “Determining of the user attitudes on mobile security programs with machine learning methods,” Informatica (Slovenia) , 2021. [Online]. A vailable: https://doi.org/10.31449/ inf.v45i3.3506 [3] S. Matlatipov , X. Madatov , G. Matlatipov , A. O‘razbayev , M. Raximboyev , I. A vezmatov , U. Babajanov , L. Kurbanova, D. Xujamov , and D. Matjumayeva, “”o‘zbek tilining statistik electron lug‘at” exm dasturi uchun guvohnoma,” Intellektual mulk agentligi , 2020. [4] A. W . Pradana and M. Hayaty , “The ef fect of stemming and removal of stop words on the accuracy of sentiment analysis on indonesian-language texts,” Game T echnology , Information System, Computer Network, Computing, Electr onics, and Contr ol Jour - nal , vol. 4, no. 3, pp. 277–288, 2019. [Online]. A vailable: https://doi.org/10.22219/kinetik.v4i4.912 [5] R. U. Haque, P . Mehera, M. F . Mridha, and M. A. Hamid, “A complete bengali stop word detection mechanism,” in Confer ence Paper ∙ May 2019 . Conference, 2019. [Online]. A vailable: https: //doi.org/10.1109/ICIEV.2019.8858544 [6] R. Rania and D.K.Lobiyal, “Automatic construction of generic stop words list for hindi text,” in Interna- tional Confer ence on Computational Intelligence and Data Science , vol. 132, International Conference on Computational Intelligence and Data Science. IC- CIDS 2018, 2018, pp. 362–370. 150 Informatica 47 (2023) 143–150 Madatov et al. [7] P . J. Burns, “Constructing stoplists for historical languages,” Digital Classics Online , vol. 4, no. 2, 2018. [Online]. A vailable: https://doi.org/10.11588/ dco.2018.2.52124 [8] R. M. Rakholia and J. R. Saini, “A rule-based ap- proach to identify stop words for gujarati language,” in In Pr oceedings of the 5th International Confer ence on Fr ontiers in Intelligent Computing: Theory and Applications , 2017, pp. 797–806. [9] J. K. Raulji and J. R. Saini, “Generating stopword list for sanskrit language,” in In: 2017 IEEE 7th Interna- tional Advance Computing Confer ence . IEEE 7th, 2017, pp. 799–802. [10] O. D. T ijani, A. T . Akinwale, S. A. Onashoga, and E. O. Adeleke, “An auto-generated approach of stop words using aggregated analysis,” in In: Pr oceedings of the 13th International Confer ence of the Nigeria Computer Society , 2017, pp. 99–1 15. [1 1] M. Mhatre, D. Phondekar , P . Kadam, A. Chawathe, and K. Ghag, “Dimensionality reduction for senti- ment analysis using pre-processing techniques,” in In Pr oceedings of the IEEE 2017 International Confer - ence on Computing Methodologies and Communica- tion . ICCMC, 2017, pp. 16–21. [Online]. A vailable: https://doi.org/10.1109/ICCMC.2017.8282676 [12] C. Sammut and G. I. W ebb, Eds., TF–IDF . Boston, MA: Springer US, 2010, pp. 986– 987. [Online]. A vailable: https://doi.org/10.1007/ 978- 0- 387- 30164- 8_832 [13] Y . W ang, K. Kim, B. Lee, and H. Y . Y oun, “W ord clustering based on pos feature for ef ficient twitter sentiment analysis,” Human-centric Comput , vol. 8, no. 17, pp. 1–25, 2019. [Online]. A vailable: https://doi.org/10.1186/s13673- 018- 0140- y [14] N. Ousirimaneechai and S. Sinthupinyo, “Extraction of trend keywords and stop words from thai facebook pages using character n-grams,” International Jour - nal of Machine Learning and Computing , vol. 8, no. 6, 2018. [15] C. Slamet, A. R. Atmadja, D. S. Maylawati, R. S. Lestari, W . Dharmalaksana, and M. A. Ramdhani, “Automated text summarization for indonesian article using vector space model model,” in IOP Conf. Ser . Mater . Sci. Eng. , vol. 288, no. 1, Conference. IOP , 2018. [Online]. A vailable: https://doi.org/10.1088/ 1757- 899X/288/1/012037 [16] G. Li and J. Li, “Research on sentiment classification for tang poetry based on tf-idf and fp-growth,” in Pr oceedings of 2018 IEEE 3r d Advanced Information T echnology , Electronic and Automation Control Conference. IAEAC, 2018, pp. 630–634. [Online]. A vailable: https: //doi.org/10.1109/IAEAC.2018.8577715 [17] H. M. Zin, N. Mustapha, M. A. A. Murad, and N. M. Sharef, “The ef fects of pre-processing strategies in sentiment analysis of online movie reviews,” in AIP Conf. Pr oc. , vol. 1891, no. 1. AIP Conf., 2017, pp. 1–7. [Online]. A vailable: https: //doi.org/10.1063/1.5005422 [18] S. K. Metin and B. Karaog’lan, “Stop word detec- tion as a binary classification problem,” Anadolu University Journal of Science and T echnology A- Applied Sciences and Engineering , vol. 18, no. 2, pp. 346–359, 2017. [Online]. A vailable: https://doi.org/10.18038/aubtda.322136 [19] J. K. Raulji and J. R. Saini, “Generating stop word list for sanskrit language,” in In Advance Computing Confer ence IEEE 7th International . IEEE, 2017, pp. 799–802. [20] S. J. R. Rakholia R. M., “A rule-based approach to identify stop words for gujarati language,” in Sur esh Chandra Satapathy V ikrant Bhateja Siba K. , 2017. [21] R. M. Rakholia and J. R. Saini, “Information re- trieval for gujarati language using cosine similarity based vector space model,” in Theory and Applica- tions . Springer_Singapore, 2017, pp. 1–9. [22] X.Madatov and S. Matlatipov , “Kosinus o’xshahshlik va uning o’zbek tili matnlariga tatbiqi haqida,” O’zMU xabarlari , vol. 2, no. 1, 2016. [23] K. Madatov , S. Bekchanov , and J. V ičič, “Lists of uzbek stopwords (1.1) [data set],” Zenodo. [On- line]. A vailable: \url{https://doi.org/10.5281/zenodo. 6319953} [24] K. Madatov , S. Bekchanov , and J. V ičič, “Dataset of stopwords extracted from uzbek texts,” Data in Brief , vol. 43, p. 108351, 2022. [25] K. A. Madatov , D. Khujamov , and B. Boltayev , “Cre- ating of the uzbek wordnet based on turkish word- net,” in AIP Confer ence Pr o ceedings , vol. 2432, no. 1. AIP Publishing LLC, 2022, p. 060009. [26] S. Matlatipov , H. Rahimboeva, J. Rajabov , and E. Kuriyozov , “Uzbek sentiment analysis based on local restaurant reviews,” arXiv pr eprint arXiv:2205.15930 , 2022. [27] U. Salaev , E. Kuriyozov , and C. Gómez-Rodríguez, “Simreluz: Similarity and relatedness scores as a se- mantic evaluation dataset for uzbek language,” arXiv pr eprint arXiv:2205.06072 , 2022.