https://doi.org/10.31449/inf.v46i3.3970                                                                                      Informatica 46 (2022) 373-382     373 
Automatic Classification of Document Resources Based on Naive 
Bayesian Classification Algorithm 
 
Rong Wang
1*
 
Email: rongwang9@126.com 
 
Keywords: Literature Resources; Naive Bayes Data Discretization; Automatic Classification; Ontology Integration 
Module 
 
Received: February 3, 2022 
 
World Wide Web has become big as the amount of documents collection is increasing rapidly. The 
automatic classification of document resources based on Naive Bayesian classification algorithm is detailed 
in this paper. Firstly, this paper introduces the relevant theories of naive Bayes classification and the 
automatic document classification system. Then, a massive network academic document automatic 
classification system is designed and implemented. The system uses modular design, including academic 
document automatic capture module, academic document word document matrix processing module, 
ontology integration module and semantic driven classification module. Finally, based on the Naive 
Bayesian classification algorithm, the training set of 12 categories preset is utilized in the professional 
classification directory of the Ministry of education.. Experiments show that the naive Bayesian 
classification algorithm can effectively complete the automatic capture, processing and classification of 
massive academic documents, which can not only improve the classification accuracy, but also reduce the 
running time of automatic classification. It solves the problems of the integration of two heterogeneous 
ontology libraries and also the problem that the traditional word vector space cannot meet people's needs 
for semantic classification. 
 
  Povzetek: Za avtomatsko klasifikacijo dokumentov s spleta je implementiran naivni Bayesov algoritem. 
 
1 I ntr o duct i o n 
The Internet is an information resource of text, 
images, audio and video. There is rapid increase in the 
amount of information available on the World Wide 
Web (WWW) at an exponential rate. This rich textual 
information is contained in the Web documents but the 
growth of the internet has made it difficult for users for 
location of relevant information quickly on the Web. At 
present, network academic resources show an upward 
trend in both breadth and depth, and have attracted more 
and more attention from the academic community. The 
massive network academic literature has a huge scale 
and fast update speed. Fully mining it has important 
academic value. However, these characteristics have 
also become a stumbling block for scientific researchers 
to make use of it. How to acquire and process a large 
amount of academic literature is a severe test for 
computer processing and throughput. Whether in terms 
of processing speed, storage space, fault tolerance or 
access speed, it is difficult for single computer platform 
architecture and processing capacity to successfully 
complete this task. Due to the huge number of network 
academic documents, it is difficult to make effective use 
of them, so it is of practical significance to classify them 
automatically based on disciplines. Automatic document 
classification is widely used in the fields of information 
retrieval, data mining, spam filtering, digital library and 
so on. There are two common classification methods: 
one is rule-based, which usually requires a large number 
of domain experts to extract the rules of the text, which 
is time-consuming and laborious, and the classification 
effect is poor; Another kind of method is machine 
learning method based on statistics, including nearest 
neighbor method, support vector machine, naive Bayes, 
decision tree, neural network, etc. this kind of method 
usually uses feature vector space to train document 
classification model. However, word feature vectors 
ignore the semantic relationship between words and 
cannot reflect synonyms, polysemy and the upper and 
lower relationship between words, resulting in too high 
vector space dimension. When automatically classifying 
massive documents, there will be problems such as 
insufficient memory, slow classification speed and low 
classification performance, Automatic document 
classification technology and method cannot be more 
widely applied to the practice of specific fields [2]. In 
order to solve the problems existing in the traditional 
automatic document classification based on word vector 
space, a series of semantic driven automatic document 
classification methods are proposed, such as latent 
semantic analysis method, ontology semantic mapping 
method, concept lattice construction method, 
standardized concept analysis method and so on. 
Although the semantic driven automatic text 
classification method can greatly reduce the dimension 
of document vector space, it also has many defects, such 
as high requirements for semantic reasoning ability, 
high computational complexity, and unable to classify 
web documents quickly and effectively. ” 
374     Informatica 46 (2022) 373-382                                                                                                                                        R. Wang 
Bayesian classification (Figure 1) is proposed on 
the solid theoretical basis of Bayesian theorem. For a 
given sample, the posterior probability of belonging to 
each category is calculated according to the distribution 
of each category sample in the training set, and then the 
category of the sample is judged as the category 
corresponding to the maximum posterior probability. 
The principle of this method is simple, but when the 
number of attributes is large, training and learning a 
classification model completely according to Bayesian 
theorem will have a huge computational overhead and 
will be greatly limited in practical application [3]. 
Therefore, scholars simplified a hypothesis of attribute 
conditional independence, and proposed a practical 
naive Bayesian classification algorithm, which greatly 
reduced the computational overhead in the process of 
model training. At the same time, the research also 
shows that naive Bayesian classification method still has 
good performance in many practical applications. 
 
 
Figure 1: Bayesian classification 
 
Contribution: This paper introduces the relevant 
theories of naive Bayes classification and the automatic 
document classification system. Then, a massive 
network academic document automatic classification 
system is designed and implemented. 
The organization of the paper is as follows. Section 
2 provides an overview of the exhaustive literature 
survey followed by the Automatic classification of 
massive network academic documents adopted in 
section 3. The experimental analysis is in section 4. 
Finally, Section 5 concludes the paper.   
2 Literature review 
Du, J. h. and others also proposed a network 
extended naive Bayesian classification model (BAN). 
This method extends the structure of naive Bayesian 
classifier to a greater extent. It is the same as the 
improved model of TAN. Its fundamental starting point 
is to weaken the assumption that attributes are 
independent to a greater extent. The ban model is the 
same as the TAN model in many aspects. The BAN 
model also stipulates that the class node is the root. At 
the same time, all other attribute nodes take its parent 
node and the BAN classifier uses Bayesian network as 
the expression structure, which is the only difference 
[4]. Y Kumar's Bayesian augmented naive Bayesian 
classifier GBAN is based on genetic algorithm. GBAN 
model can meet the limitations of the network extended 
naive Bayesian classification model on the network 
structure, that is, any attribute node has at most M 
parent nodes (generally m < 4), but the category 
variables are not included [5]. The hybrid tree 
augmented naive Bayesian classification model 
proposed by DIAS, K. L. is based on rough set theory. 
The composition process of augmented naive 
Bayesian classification model is as follows: Based on 
the attribute reduction theory of rough set, under the 
condition of keeping the classification ability 
unchanged, it is divided into two categories according to 
the impact of attribute variables on the classification 
results. It is assumed that the attribute variables that 
have no or little impact on the classification results are 
independent of each other, and these nodes can only 
have one parent node, The attribute variables that affect 
the classification results are not independent of each 
other, and these nodes can have two parent nodes [6]. 
Tajanpur proposed a hybrid model (nbtree) combining 
decision tree and naive Bayes. The process of learning 
nbtree by the algorithm is similar to that of decision tree 
(C4.5), but it is different in the selection of attribute 
splitting evaluation score function [7]. Gaber, A. and 
others proposed an average naive Bayesian tree model 
[8]. Lopes, F. and others proposed an improved naive 
Bayes model (LBR) combining lazy technology and 
naive Bayes, which can obtain high classification 
accuracy, but the classification efficiency of this method 
is not very high [9]. In terms of automatic document 
classification, the classification method based on 
coverage coefficient by an, Y. and others is a 
classification method based on the inherent attributes of 
document set. This method borrows mathematical tools 
to derive a classification step with rigorous reasoning. 
The premise is that (under certain general 
assumptions) the class and number of classes of each 
document in the document set have been determined by 
the inherent attributes of the document set itself [10]. 
Rueda and others proposed an automatic acquisition and 
parallel processing model of massive network academic 
documents. The rules specified by the heritrix platform 
are used to capture the data of the seed site. For the 
captured file resources, they are judged according to the 
set academic literature feature rules, and then some of 
them are selected to invite domain experts for category 
indexing, train the machine learning classification 
algorithm, and finally realize the classification of all 
documents [11]. In previous years, many researchers 
have worked on this particular field, some of the 
relevant articles are tabulated in Table 1. 
 
 
 
Automatic Classification of Document Resources Based on…                                                  Informatica 46 (2022) 373-382     375 
Authors Presented Work Key points Benefits 
Refere
nces 
Mohamed 
EL KOURDI et 
al., 
“Naive Bayes (NB) 
is a statistical machine 
learning algorithm 
utilized for the 
classification of non-
vocalized Arabic web 
documents which is 
presented in this paper.” 
“The data set 
utilized during the 
experiments 
consists of 300 web 
documents per 
category.” 
High 
classification 
accuracy 
[12] 
Huaixin 
Chen et al., 
2018 
“"Improved Naïve 
Bayes classifiers are 
presented utilizing 
multinomial model.” 
“The proposed 
method is able to 
improve the 
accuracy of Naïve 
Bayes classifiers 
dramatically.” 
Good 
scalability 
[13] 
Yong 
Wang et al., 
2003 
“An automatic 
document classification 
system, WebDoc, which 
classifies Web 
documents according to 
the Library of congress 
is presented.” 
 
“Performance 
of each method in 
terms of recall, 
precision, and F-
measures is 
reported.” 
Highly 
effective and 
efficient. 
[14] 
A. B. 
Adetunji et al., 
2018 
“A University web 
site is used as a case 
study and a machine 
learning workbench 
called WEKA is 
discussed.” 
“General-
purpose 
environment for 
automatic 
classification, 
clustering and 
feature selection are 
provided.” 
Naïve Bayes 
algorithm ability 
is to accurately 
classify the web 
document vast 
amount. 
[15] 
Yugang 
Dai a et al., 
2014 
“Naïve bayesian 
classification algorithm 
is presented by the 
author which is further 
combining with the 
rough set theory.” 
“This algorithm 
is implemented on a 
cloud platform 
utilizing map-
reduce 
programming 
mode.” 
High recall 
rate 
[16] 
Table 1: Some existing and relevant articles in previous years 
 
 
 
 
 
 
 
 
 
 
376     Informatica 46 (2022) 373-382                                                                                                                                        R. Wang 
3 Automatic classifications of 
massive network academic 
documents 
With the goal of automatically acquiring massive 
documents and automatically classifying documents, its 
framework is shown in Figure 2: 
 
 
Figure 2: Framework of automatic classification system 
for massive network academic documents 
 
The automatic document acquisition module first 
captures and determines academic documents from the 
Internet according to predetermined rules and 
conditions, so as to filter irrelevant documents; Then, 
through the matrix processing module, the academic 
literature is transformed into a word document matrix 
for subsequent processing; Finally, the word document 
matrix is imported into the automatic classification 
module after training and ontology integration to obtain 
the classification results [17, 18]. 
(1) Automatic acquisition of massive network 
academic documents 
In the automatic classification system of massive 
network academic documents, it is necessary to obtain 
massive academic documents. First, use heritrix to grab 
all PDF files under the domain name from a specific 
website, read all PDF files with checkpdf, and identify 
academic literature through rule-based judgment 
method, as shown in Figure 3: 
 
 
Figure 3: Automatic acquisition of massive network 
academic documents 
 
In the selection of capture tools, the author studies 
and analyzes the network resource capture platforms 
such as nutch, heritrix, jspider and web harvest from the 
aspects of capture efficiency and scalability, and finally 
selects heritrix as the capture platform. Heritrix has high 
scalability [19, 20], can retain the original file structure 
and directory, and has a web user interface. It runs on 
Linux system and can ensure high capture speed. In 
terms of file format, considering the convenience of 
subsequent processing and the proportion of various file 
types, PDF is selected as the main capture file type. 
After the PDF file is captured, it needs to be screened to 
retain the academic literature. The rule-based decision 
method is used, that is, the decision is made through 
keywords. By analyzing a large number of academic 
documents, it is found that its unique characteristic 
words include abstract, keywords, introduction, 
discussion, conclusion and recognition. Different 
documents may contain several words respectively. 
Therefore, a threshold can be set to judge according to 
the number of the above words [21-22]. 
(2) Massive network academic literature words -- 
document matrix processing 
In view of the large number of documents to be 
processed, the word frequency matrix is generated by 
distributed processing. This part is implemented using 
Hadoop, including Hadoop namenode and Hadoop 
datanode. Namenode is responsible for the scheduling of 
parallel processing, and datanode is responsible for the 
actual parallel processing. Academic documents are first 
read into the Hadoop platform, and an index of all 
documents is saved on the namenode. The actual 
documents are saved on at least two datanodes in the 
form of redundancy, and finally passed Namenode calls 
the parallel processing program to generate the word 
document matrix of academic literature [23-25], as 
shown in Figure 4: 
 
 
Figure 4: Massive network academic literature words - 
document matrix processing 
 
In the map phase of Hadoop, stringtokenizer is 
used to extract the words in the literature in turn and 
generate a key \ value pair < word, document ID >. In 
the reduce phase of Hadoop, a reducer is used to process 
the same word, create an array with the length of 
documents, save the word frequency of the current word 
in the corresponding documents, and then accept the key 
\ value pair in turn and update the array. Output the 
matrix after all reducer work is completed. Since this 
matrix is sparse, you can delete 0 bits and output sparse 
matrix to reduce storage space [26, 27]. 
(3) Ontology integration 
In order to understand natural language, the 
common method is to use ontology library to annotate 
and integrate text. This part mainly uses prompt. Prompt 
first reads the ontology, then analyzes the relationship 
between concepts, maps the same concepts, retains the 
special concepts in an ontology library, and finally 
generates an integrated integrated ontology, as shown in 
Figure 5. 
Automatic Classification of Document Resources Based on…                                                  Informatica 46 (2022) 373-382     377 
 
 
Figure 5: Ontology integration 
 
3.1 Naive Bayesian algorithm 
Before describing the naive Bayesian classification 
algorithm, the classification problem is formalized from 
the perspective of statistics. Let X represent the 
attribute set of the system data set, 
 
m
A A A X  , ,
2 1
 , Y represent the class label set 
of the system data set, and  
t
C C C Y  , ,
2 1
 . 
Because the relationship between class variables and 
attributes is uncertain, X and Y can be regarded as 
random variables, and   X Y P  
 can be used to capture 
the relationship between them in a probabilistic manner. 
  X Y P  
 is also called a posteriori probability of class 
Y . Correspondingly,   Y P is called a priori 
probability of Class Y [28, 29]. 
In the training stage of naive Bayesian 
classification algorithm, firstly, the information statistics 
of the training data set is carried out, and the a posteriori 
probability   X Y P of each combination of attribute 
sets X and Y is calculated. After calculating these 
probabilities, the test sample X ‘can be classified by 
finding the class Y ‘ that maximizes the delay 
probability   X Y P  
. However, it is very difficult to 
accurately estimate the a posteriori probability of each 
possible combination of Class Y and attribute values, 
because even if the number of attributes is not many, a 
large training set is still required. At this time, the 
Bayesian theorem plays an important role, because the 
posterior probability can be expressed by the prior 
probability   Y P , the class conditional probability 
  Y X P and the evidence   X P through the Bayesian 
theorem. The formula for calculating the posterior 
probability   X Y P by the Bayesian theorem is formula 
(1). 
 
   
  X P
Y P Y X P
X Y P  （1 ） 
When comparing the posterior probabilities of 
different Y values, the denominator   X P is always 
constant and can be ignored. The prior probability 
  Y P can be easily estimated by calculating the 
proportion of training samples belonging to each class in 
the total training samples in the training data set. 
However, for the training data with 
m
 attributes [30, 
31], the calculation of class conditional probability 
  Y X P is time-consuming. In order to improve the 
efficiency of calculating   Y X P , naive Bayesian 
classification algorithm assumes that the attributes are 
conditionally independent when estimating the 
conditional probability of classes. The assumption of 
attribute conditional independence can be expressed by 
formula (2): 
   


  
m
i
i
y Y X P y Y X P
1
（2 ） 
Through the conditional independence assumption, 
it is not necessary to calculate the class conditional 
probability of each value group sum of X , but to mark 
Y for a given class and calculate the conditional 
probability of each 
i
X . In contrast, the latter method is 
more practical. Because through the assumption of 
conditional independence, better probability statistics 
can be obtained without a large training data set [32-34]. 
In the classification test stage, naive Bayesian 
classification algorithm calculates a posteriori 
probability for each X , as shown in formula (3): 
 
   
  X P
X P Y P
X Y P
m
i
i 


1
（3 ） 
378     Informatica 46 (2022) 373-382                                                                                                                                        R. Wang 
Because   Y P and   X P are fixed for fixed 
training data sets and determined test data. Therefore, it 
is sufficient to find the class that maximizes the 
molecular    


m
i
i
X P Y P
1
. For naive Bayesian 
classification algorithm, the biggest disadvantage is that 
naive Bayesian classification algorithm can only deal 
with discrete attributes [35, 36]. 
4 Experimental Analysis 
The experimental classification standard selects 12 
categories preset in the professional classification 
catalogue of the Ministry of education of the people's 
Republic of China, namely philosophy, economics, law, 
pedagogy, literature, history, science, engineering, 
agronomy, medicine, management and military science. 
The literature data sets used in the experiment include 
isolet, covtype and census_. The specific description of 
the data set is shown in Table 2.
 
Experi
ment No  
 Number 
of 
documents  
 Number 
of matrix 
rows  
 
Number of 
matrix 
columns  
 Matrix 
size  
 Computing 
time 
1 72 72 29876 5.7 3 
2 728 728 175897 332.6 14 
3 7159 7159 746239 17.8 27 minutes and 
100 seconds 
4 108026 108026 903452 198.6 7 hours 34 
minutes 20 seconds 
Table 2: Description of algorithm experimental data 
 
In terms of data sources, after analyzing different 
target sources, it is found that famous university 
websites, some discipline portals and OA warehouses 
contain a large number of publicly published academic 
documents, which can be captured without restrictions. 
Therefore, it is determined to take university websites, 
OA warehouses and discipline portals as target sources. 
In order to make the results more representatives, the 
conference website and the researcher's home page were 
also added. The target sites selected in this experiment 
are shown in Table 3. 
 
No. Site Brief Introduction Type 
1 https: //www. stanford.edu 
Stanford University 
website 
Univers
ity website 
2 https://www.omicsonline.org 
Omnics group 
website 
OA 
warehousing 
3 https://www.acm.org 
American 
Computer Society 
website 
Subject 
Portal 
4 https://webis.de 
International 
Conference pan website 
Confere
nce website 
Table 3: Document capture target sites 
 
It can be seen from the experimental results that the 
classification accuracy of naive Bayes has been slightly 
improved after discretization. The reason is that after 
discretization, the continuous attributes are mapped into 
discrete classification attributes, which makes the system 
more complete, and avoids a potential problem in 
estimating a posteriori probability from training data to a 
certain extent: the class conditional probability of 
attributes is equal to zero, The extreme case that the 
posterior probability of the whole class is equal to zero, 
resulting in classification error or inability to classify. 
The experimental results show that the classification 
accuracy of the algorithm can be greatly improved by 
discretizing the continuous data through the parallel 
attribute discretization algorithm based on direct.   
In the aspect of algorithm execution efficiency, the 
running time of the two algorithms to deal with data 
classification tasks of different scales under the 
Automatic Classification of Document Resources Based on…                                                  Informatica 46 (2022) 373-382     379 
environment of different number of nodes is recorded 
respectively. The specific running time is shown in 
Figure 6. 
0 200 400 600 800 1000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
Matrix is the number of columns
The literature number
 Ordinary algorithm
 Naive Bayes algorithm
Figure 6: Comparison of algorithm running time 
 
As can be seen from Figure 6, the classification 
results of all academic documents can be viewed through 
this system. The year data is not mined at the text level, 
but directly uses the PDF file metadata (creation date). 
On the document display page, you can view the title, 
category, original URL and excerpt of the text of the 
document. Each interface is equipped with faceted search 
function to facilitate users' secondary retrieval. The 
efficiency of these algorithms in terms if run time is 
calculated and shown in Figure 7.  
 
 
Figure 7: Comparative analysis of the algorithms in 
terms of efficiency 
 
The Naïve algorithm is much more effective and 
efficient in terms of complexity. The Naïve Bayes 
algorithm is 82% efficient and the ordinary algorithm 
efficiency is 70%.  
4 C o ncl u s i o n 
The successful design and implementation of naive 
Bayesian classification algorithm can not only solve the 
problems of large memory consumption, slow processing 
speed and high feature vector dimension in the process of 
massive document processing, but also enable scientific 
researchers to effectively obtain and use the documents. 
At the same time, it also solves the problems of the 
integration of two heterogeneous ontology libraries and 
how to apply them in specific fields. The problem is that 
the traditional word vector space cannot meet people's 
needs for semantic classification, semantic navigation 
and semantic retrieval of massive network information 
resources due to high dimension and lack of semantics. 
Therefore, it has academic value and practical 
significance. The design idea and framework of the 
system can be directly applied to e-government system, 
portal website, vertical search engine, digital library 
website and so on. The main strength of the approach lies 
in its ability to classify the web documents into the right 
categories correctly and in zero seconds. The future work 
of this work can be on combining two classification 
techniques to increase the accuracy of a web page 
classification. 
R efer ence s 
[1]  Li, W. Q. ,  Li, Y. ,  Chen, J. , &  Hou, C. Y. . (2017). 
Product functional information based automatic 
patent classification: method and experimental 
studies. Information Systems, 67(JUL.), 71-82. 
https://doi.org/10.1016/j.is.2017.03.007 
[2]  Agnihotri, D. ,  Verma, K. , &  Tripathi, P. . (2018). 
An automatic classification of text documents based 
on correlative association of words. Journal of 
Intelligent Information Systems, 50(3), 549-572. 
[3]  Pajic, M. S. ,  Veinovic, M. ,  Peric, M. , &  Orlic, V. 
D. . (2020). Modulation order reduction method for 
improving the performance of amc algorithm based 
on sixth – order cumulants. IEEE Access, PP(99), 1-
1. DOI: 10.1109/ACCESS.2020.3000358 
[4]  Du, J. H. . (2017). Automatic text classification 
algorithm based on gauss improved convolutional 
neural network. Journal of Computational Science, 
21(jul.), 195-200. 
[5]  Y  Kumar,  Sheoran, M. ,  Jajoo, G. , &  Yadav, S. K. 
. (2020). Automatic modulation classification based 
on constellation density using deep learning. IEEE 
Communications Letters, PP(99), 1-1, 
DOI: 10.1109/LCOMM.2020.2980840 
[6]  Dias, K. L. ,  Pongelupe, M. A. ,  Caminhas, W. M. , 
&  Errico, L. D. . (2019). An innovative approach for 
real-time network traffic classification. Computer 
networks, 158(JUL.20), 143-157, 
https://doi.org/10.1016/j.comnet.2019.04.004 
[7] TajanpureRupalirupalidixit@gmail.comMuddanaAkk
alakshmiamuddana@gitam.eduGITAM 
University,Hyderabad,Telangana,India. (2021). 
Circular convolution-based feature extraction 
algorithm for classification of high-dimensional 
datasets. Journal of Intelligent Systems, 30(1), 1026-
1039, https://doi.org/10.1515/jisys-2020-0064 
[8]  Gaber, A. ,  Hamdy, A. ,  Abdelaal, H. M. ,  Elkattan, 
A. , &  Youness, H. A. . (2021). Automatic 
classification algorithm for diffused liver diseases 
based on ultrasound images. IEEE Access, PP(99), 1-
1, DOI: 10.1109/ACCESS.2021.3049341. 
[9]  Lopes, F. ,  Agnelo, J. ,  Teixeira, C. A. ,  Laranjeiro, 
N. , &  Bernardino, J. . (2020). Automating 
orthogonal defect classification using machine 
60%
65%
70%
75%
80%
85%
Ordinary Algorithm Naïve Bayes
Algorithm
Efficiency in terms of running 
time
Algorithms
Efficiency in terms of running time
380     Informatica 46 (2022) 373-382                                                                                                                                        R. Wang 
learning algorithms. Future generation computer 
systems, 102(Jan.), 932-947, 
DOI: 10.1109/ACCESS.2021.3049341 
[10]  An, Y. ,  Xu, M. , &  Shen, C. . (2019). 
Classification method of teaching resources based on 
improved knn algorithm. International Journal of 
Emerging Technologies in Learning (iJET), 14(4), 
73-88, https://doi.org/10.3991/ijet.v14i04.10131 
[11]  Rueda, C. A. , &  Ryan, J. P. . (2020). Humpback 
whale song analysis based on automatic classification 
performance. The Journal of the Acoustical Society 
of America, 148(4), 2597-2597, 
https://doi.org/10.1121/1.5147215 
[12] El Kourdi, M., Bensaid, A., & Rachidi, T. E. (2004). 
Automatic Arabic document categorization based on 
the Naïve Bayes algorithm. In proceedings of the 
Workshop on Computational Approaches to Arabic 
Script-based Languages (pp. 51-58), 
https://dl.acm.org/doi/10.5555/1621804.1621819 
[13] Chen, H., & Fu, D. (2018, March). An improved 
Naive Bayes classifier for large scale text. In 2018 
2nd International Conference on Artificial 
Intelligence: Technologies and Applications (ICAITA 
2018) (pp. 33-36). Atlantis Press, 
https://doi.org/10.2991/icaita-18.2018.9 
[14] Wang, Y., Hodges, J., & Tang, B. (2003, 
November). Classification of web documents using a 
naive bayes method. In Proceedings. 15th IEEE 
International Conference on Tools with Artificial 
Intelligence (pp. 560-564). IEEE, 
DOI: 10.1109/TAI.2003.1250241 
[15] Adetunji, A. B., Oguntoye, J. P., Fenwa, O. D., & 
Akande, N. O. (2018). Web Document Classification 
Using Naïve Bayes. Journal of Advances in 
Mathematics and Computer Science, 29(6), 1-11, 
DOI: https://doi.org/10.48550/arXiv.2006.01715 
[16] Dai, Y., & Sun, H. (2014). The naive Bayes text 
classification algorithm based on rough set in the 
cloud platform. Journal of Chemical and 
Pharmaceutical Research, 6(7), 1636-1643, 
https://doi.org/10.1007/s00500-020-05410-9 
[17] Koopman, B. ,  Zuccon, G. ,  Nguyen, A. ,  
Bergheim, A. , &  Grayson, N. . (2015). Automatic 
icd-10 classification of cancers from free-text death 
certificates. International journal of medical 
informatics, 84(11), 956-965, 
DOI: 10.1016/j.ijmedinf.2015.08.004 
[18]  Li, K. , &  Sidorovskaia, N. . (2019). Detection and 
classification beaked whale vocalization calls based 
on unsupervised machine learning algorithm. The 
Journal of the Acoustical Society of America, 145(3), 
1855-1856. 
[19] Sharma, A., & Kumar, R. (2019). Computation of 
the reliable and quickest data path for healthcare 
services by using service-level agreements and 
energy constraints. Arabian Journal for Science and 
Engineering, 44(11), 9087-9104, 10.1007/s13369-
019-03836 
[20]  Harakawa, R. ,  Ogawa, T. ,  Haseyama, M. , &  
Akamatsu, T. . (2018). Automatic detection of fish 
sounds based on multi-stage classification including 
logistic regression via adaptive feature weighting. 
The Journal of the Acoustical Society of America, 
144(5), 2709-2718, DOI: 10.1121/1.5067373 
[21] Hartvigsen, L. ,  Kongsted, A. ,  Vach, W. ,  Salmi, 
L. R. , &  Hestbaek, L. . (2018). Does a diagnostic 
classification algorithm help to predict the course of 
low back pain? a study of danish chiropractic patients 
with one-year follow up. Journal of Orthopaedic and 
Sports Physical Therapy, 48(11), 1-35, 
DOI: 10.2519/jospt.2018.8083 
[22] Sharma, A., & Kumar, R. (2019). Risk-energy aware 
service level agreement assessment for computing 
quickest path in computer networks. International 
Journal of Reliability and Safety, 13(1-2), 96-124. 
[23] M Foroutan, & JR Zimbelman. (2017). Semi-
automatic mapping of linear-trending bedforms using 
'self-organizing maps' algorithm. Geomorphology, 
293(PT.A), 156-166. 
 Heidari, M. ,  Lakshmivarahan, S. ,  Mirniaharikandehei, 
S. ,  Danala, G. , &  Zheng, B. . (2021). Applying a 
random projection algorithm to optimize machine 
learning model for breast lesion classification. IEEE 
Transactions on Biomedical Engineering, PP(99), 1-
1,  
https://doi.org/10.1109/TBME.2021.3054248 
[24]  Nardini, A. , &  Brierley, G. . (2020). Automatic 
river planform identification by a logical-heuristic 
algorithm. Geomorphology, 375(1 –2), 107558, 
https://doi.org/10.1016/j.geomorph.2020.107558 
[25]  Yan, J. ,  Lin, S. ,  Kang, S. B. , &  Tang, X. . 
(2015). Change-based image cropping with exclusion 
and compositional features. International Journal of 
Computer Vision, 114(1), 74-87, DOI: 
https://doi.org/10.1007/s11263-015-0801-5 
[26]  Elsanadily, S. ,  Mahran, A. , &  Elghandour, O. . 
(2018). Classification-based algorithm for bit-flipping 
decoding of gldpc codes over awgn channels. IEEE 
Communications Letters, PP(99), 1-1, 
DOI: 10.1109/LCOMM.2018.2840146 
[27] Sharma, A., Kumar, R., & Bajaj, R. K. (2021). On 
Energy-constrained quickest path problem in green 
communication using intuitionistic trapezoidal fuzzy 
numbers. Recent Advances in Computer Science and 
Communications (Formerly: Recent Patents on 
Computer Science), 14(1), 192-200, DOI: 
https://doi.org/10.2174/221327591166618102512522
4 
[28]  Bahadure, N. B. ,  Ray, A. K. , &  Thethi, H. P. . 
(2018). Comparative approach of mri-based brain 
tumor segmentation and classification using genetic 
algorithm. Journal of Digital Imaging, 31(1), 1-13, 
DOI: DOI: 10.1007/s10278-018-0050-6 
[29]  Redzic, M. ,  Laoudias, C. , &  Kyriakides, I. . 
(2019). Image and wlan bimodal integration for 
indoor user localization. IEEE Transactions on 
Mobile Computing, 19(99), 1109-1122, DOI: 
DOI: 10.1109/TMC.2019.2903044 
Automatic Classification of Document Resources Based on…                                                  Informatica 46 (2022) 373-382     381 
[30] Zhao, D. ,  Liu, S. ,  Yang, X. ,  Ma, Y. , &  Chu, W. 
. (2021). Research on camouflage recognition in 
simulated operational environment based on 
hyperspectral imaging technology. Journal of 
Spectroscopy, 2021(2), 1-9, DOI: 
https://doi.org/10.1155/2021/6629661 
[31] Poongodi, M., Sharma, A., Vijayakumar, V., 
Bhardwaj, V., Sharma, A. P., Iqbal, R., & Kumar, R. 
(2020). Prediction of the price of Ethereum 
blockchain cryptocurrency in an industrial finance 
system. Computers & Electrical Engineering, 81, 
106527, DOI: 
https://doi.org/10.1016/j.compeleceng.2019.106527 
[32]  Ahmed, I. ,  Ali, R. ,  D  Guan,  Lee, Y. K. ,  Lee, S. 
, &  Chung, T. C. . (2015). Semi-supervised learning 
using frequent itemset and ensemble learning for sms 
classification. Expert Systems with Applications, 
42(3), 1065-1073, DOI: 
https://doi.org/10.1016/j.eswa.2014.08.054 
[33] Kumar, C., Singh, A. K., Kumar, P., Singh, R., & 
Singh, S. (2020). SPIHT‐based multiple image 
watermarking in NSCT domain. Concurrency and 
Computation: Practice and Experience, 32(1), e4912, 
DOI: https://doi.org/10.1002/cpe.4912 
[34]  Dadaneh, B. Z. ,  Markid, H. Y. , &  
Zakerolhosseini, A. . (2016). Unsupervised 
probabilistic feature selection using ant colony 
optimization. Expert Systems with Applications, 
53(Jul.), 27-42, 
https://doi.org/10.1016/j.eswa.2016.01.021 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382     Informatica 46 (2022) 373-382                                                                                                                                        R. Wang