Informatica 42 (2018) 137–143 137 
Application of Distributed Web Crawlers in Information 
Management System 
Bo Wen 
School of Computer Science and Technology, Huaibei Normal University, Huaibei, 235000, China 
E-mail: bowen1983@yeah.net 
 
Technical paper 
Keywords: web crawlers, Hadoop, information management system  
Received: February 7, 2018 
In the Internet era, cloud data and big data constantly develop, and Internet has become the main 
platform for enterprises and individuals to release information. As a result, a large amount of data 
generates, and people spend more energy on finding information that they want. The desire for 
accurately acquiring information needed becomes increasingly stronger. This study designed a 
distributed web crawlers system based on Hadoop and used it to do large-scale information 
management. The simulation experiment verified that the system could operate stably in information 
management system, which offers a reference for the application of distributed web crawlers in 
information management systems. 
Povzetek: Razvit je distribuirani spletni preiskovalnik na osnovi Hadoopa za upravljanje informacij. 
1 Introduction 
Internet rapidly develops in the 21st century, 
accompanied by data volume increasing in exponential 
form on Internet. With the diversification of information, 
the management of information has become more and 
more difficult. How to timely and accurately search 
information through search engine and manage the 
information becomes crucial. Requirements on relevant 
technologies are also being improved constantly. With 
the development of computer, information management 
system has emerged. More efficient and simple 
information management systems are being developed. 
Qin [1] designed a SG-UAP development tool based on 
Eclipse development environment which was applicable 
to Windows operation system; a database platform was 
developed based on Oracle to provide Tomcat network 
information management service; the system managed 
network information through service-oriented 
architecture. Gupta et al. [2] analyzed management 
information service and proposed to manage network 
information with management information service and 
found that management information system could 
optimize network information and accurately collect and 
manage data.  Zhao et al. [3] established a topic-focused 
crawler based scientific research information system to 
improve the information management level. Web 
crawlers can capture webpage information from the 
network and extracted and stored the key information to 
solve the urgent problem of information acquisition. But 
information collection based on web crawlers is facing 
with difficulties such as information repetition and 
existence of dynamic pages. Therefore distributed 
technologies are needed to solve the problems and 
enhance crawling efficiency. In the study of Su et al. [4], 
single-thread and multi-thread web crawlers were 
implanted into a distributed system to capture and store 
data with diversified and personalized operations, which 
enhanced the capturing speed. In the study of Zhang et al. 
[5], Hadoop based distributed web crawler system was 
optimized. The parameters were optimized through 
analysis on factors influencing crawling efficiency. 
Distributed web crawlers have great advantages in 
collecting and storing information; hence it can help 
establish a practical and high-efficient information 
management system. In this study, web crawlers were 
analyzed, and then a Hadoop based distributed web 
crawlers system was designed to manage network 
information. The simulation experiment suggested that 
the system could effectively collect and store network 
information and enhance the performance of single-node 
web crawlers, which provides a reference for the 
application of distributed network crawlers in 
information management system. 
2 System related technologies 
2.1 Web crawlers 
Distributed web crawler is a program which crawls Web 
resources on the Internet according to some rules and 
provides the obtained network information to search 
engine. Therefore it is an indispensable part of search 
engine [6]. To achieve a high crawling ability, a web 
crawler should have the five characteristics [7]. 
(1) High performance 
A large amount of information involves mass 
Uniform Resource Locator (URL). Distributed web 
138 Informatica 42 (2018) 137–143 B. Wen 
 
crawlers should timely and effectively capture 
useful information in webpage. The more the 
information in unit time is, the better the 
performance of web crawlers is. 
(2) Expandability 
Expandability should be improved to achieve a high 
performance of web crawlers. Expandability means 
that the whole crawler system will not be affect 
when the current web crawlers are being updated or 
doing other operations. Better expandability is 
needed in efficient crawling of information in 
different sites as the programming language and 
code editor are different in different websites. 
(3) Robustness. Facing with a large number of servers, 
web crawlers may encounter emergencies such as 
crawler trap in the process of work. Reasonable 
processing of these conditions is a character of an 
excellent web crawler. Only when web crawlers 
have favorable robustness can they get back to work 
after interruption. Moreover the previously crawled 
content should be restored after setup. 
(4) Friendliness: Web crawlers should protect relevant 
information of websites as per robots protocols. The 
crawling scope of web crawlers should be defined. 
Moreover additional burdens to websites should be 
avoided when web crawlers capture information. 
(5) Updatability. Web crawlers should be able to 
perceive the alternation of websites and timely 
acquire new website content to replace the old one. 
Information management system needs to collect and 
store diversified data on the Internet. With the explosive 
growth of data, the traditional stand-alone web crawlers 
have gradually been not as good as before. Hence 
stronger and more comprehensive information 
management systems are needed. 
2.2 Hadoop 
Hadoop, a basic framework of distributed system 
developed by Apache Software Foundation, is composed 
of many ordinary, low-cost single computers. It can 
rapidly and flexibly process mass data. It has the 
following advantages. 
(1) High reliability. Its ability in processing data is 
highly reliable. 
(2) Strong fault tolerance. Hadoop can automatically 
replicate many copes and allocate failed tasks.  
(3) High scalability. Hadoop can process and allocate 
data between hundreds of servers and easily expand 
to thousands of nodes. 
(4) High efficiency. Hadoop can efficiently transfer 
data between different nodes. 
(5) Low cost. Compared to other commercial data 
warehouse, Hadoop is open-source. 
Hadoop has two core parts. One is distributed file 
system, i.e. Hadoop Distributed File System (HDFS). 
HDFS is capable of storing large files, for example, files 
in a size of more than 100 TB. HDFS is also featured by 
strong fault tolerance. It can operate on low-cost 
hardware. The other core is MapReduce computational 
model which can concurrently calculate mass data and 
have favorable extensibility and fault tolerance. It has a 
huge advantage in data processing. 
2.3 Application values of distributed web 
crawlers in information management 
system 
In view of the advantages of distributed system and the 
properties of web crawlers, distributed web crawler is 
feasible. Distributed web crawler is composed of web 
crawler and distributed system, which is capable of fulfill 
different tasks by making the best use of information on 
the Internet. It effectively makes up the defects of the 
stand-alone web crawler. It can capture more websites 
and collect and store more data. Therefore Hadoop based 
distributed web crawler has high application values in 
information management system. 
3 Design of information 
management system 
3.1 Design of distributed web crawler 
system architecture 
3.1.1 Design of physical architecture 
To satisfy the aforementioned characteristics, cost of PC 
server should be saved, and moreover Hadoop based 
distributed architecture should be extensible [8]. The 
system should allocate the crawled page data on different 
nodes using its ability of distributed storage capacity. 
Moreover a strong fault tolerance was needed to set the 
number of data copies and reallocate the failed tasks on 
other nodes. The distributed architecture could enhance 
the overall performance of crawlers to the large extent. 
The physical architecture of web crawlers in this 
study included Hadoop cluster and Storm cluster [9]. To 
reduce the pressure on Hadoop cluster during operation, 
separate deployment was adopted. Crawler tasks were 
divided into multiple tasks and operated on multiple 
Slave nodes based on the distributed storage and 
calculation abilities of distributed architecture. The 
collected data were stored in clusters. Then the data 
generated when crawlers crawled and analyzed webpage 
were written into Kafka, and Storm was used to calculate 
index results in real time. The physical architecture is 
shown in Figure 1. 
3.1.2 Design of logic structure 
The logic structure of distributed web crawlers is shown 
in Figure 2. It included batch processing part and real-
time calculation part. Batch processing was mainly 
realized based on Hadoop platform, and it was 
responsible for achieving crawling tasks and storing data 
in Hbase. Real-time calculation was realized based on 
Storm platform, and it was responsible for calculating 
relevant data generated in system operation and storing 
the results in iRedis. 
Application of Distributed Web Crawlers … Informatica 42 (2018) 137–143 139 
 
3.2 Modules of distributed web crawlers 
The system module of the distributed web crawler was 
composed of the following parts. 
(1) URL splitting and injection module: firstly read the 
URL path of user, then obtain URL list, and split it 
into several parts and allocate to TaskTracker. 
(2) Webpage access module: acquire webpage according 
to URL links and download and save it locally. 
(3) Webpage analysis module: analyze the captured 
webpage in aspects of structure and content. 
(4) Link filtering module: filter the acquired URL and 
eliminate ineffective and repeated links. 
(5) Data storage module: Save data in the database of 
HDFS. 
3.3 Design of key technology 
3.3.1 URL standardization 
URL is a kind of character which can show information 
resources on www, and information resource has one and 
only has one URL [10]. URL standardization meant 
standardizing URL and transforming a URL to a 
qualified equivalent URL. Its transformation was realized 
by replacing /xx/../ with /, /../ with /, /./ with / and xx//yy 
with /. 
3.3.2 Allocation of crawler tasks 
Before crawling based on the distributed architecture, 
tasks were allocated to the distributed clusters [11]. 
When some node failed, tasks should be reallocated. For 
Hadoop cluster with n nodes, a URL was selected from 
URL set, Topn URLs were divided into N sets, and the 
sets were allocated to different nodes of Hadoop set to do 
crawling tasks. If some node failed, Master would 
allocate the failed task to other nodes without affecting 
the crawling speed of the current nodes. The network 
pressure of websites should be considered in the process 
of crawling. 
3.3.3 Balance politeness 
Crawling of the same website should follow the principle 
of balance politeness [12]. URLs were ranked according 
to score rules; then URL was taken out one by one from 
the URL set and allocated to N subsets; the number of 
URLs in one set and the number of URLs from the same 
Host in one set should be limited. In this way, the 
pressure of webpage could be reduced when web 
crawlers were crawling information. 
3.3.4 Webpage revisit 
Network usually has favorable dynamic property. When 
web crawlers fulfilled a crawling task, then the webpage 
might change. Therefore web crawlers should update 
website content at a certain time interval and the content 
which needed to be crawled. 
3.3.5 Data deduplication 
There are many same data on the network. Therefore 
network data should be processed by deduplication. 
(1) Webpage content was separated into words, i.e. 
 
Figure 1: Design of the physical architecture of 
distributed web crawlers. 
 
 
Figure 2: Design of logic architecture of distributed web crawlers. 
140 Informatica 42 (2018) 137–143 B. Wen 
 
characteristic vectors. The occurrence frequency of 
every word in documents was taken as weight.  
(2) The Hash value of every characteristic vector was 
calculated [13], and moreover those vectors were 
processed by weighed accumulation. 
(3) The result larger than 0 was denoted as 1 and 
otherwise as 0, and the final results were Simhash 
signature values [14]. 
(4) The similarity of data was determined according to 
different Simhash signature values. 
4 Concrete implementation of 
distributed crawler 
URL initial module was combined with parallel 
circulation model to analyze the procedures of URL 
insertion, URL list generation, web crawling and data 
update in the data crawling experiment of distributed 
crawler. A module circulation formed from link update in 
link library, crawl list generation, URL crawling 
execution, key information analysis to link update in link 
library. The module composition and flow circulation 
can benefit the concrete implementation of distributed 
crawler. The concrete implementation flow is shown in 
Figure 3. 
5 System test and results analysis 
Before testing of the network management, the test 
environment should be adjusted. VMware Workstation 
was installed in window host. Then Hadoop cluster was 
established in the virtual machine. In the development of 
the system, Java codes written by Eclipse IDE were 
installed in the host. Hadoop Eclipse plug-in units was 
installed and connected to Hadoop clusters. Data were 
processed using Hadoop Distributed File System (HDFS) 
and MapReduce calculation model. 
5.1 Functional test 
5.1.1 Test content and scheme 
Functional test included the following content. 
(1) Webpage crawling test 
In the initial URL set, 0, 1 and 4 URL link seeds 
were added. Then three conditions, i.e. effective 
crawling, partially effective crawling and ineffective 
crawling, were considered. After crawling, whether 
the downloaded target data satisfied standards or not 
were checked. 
(2) Filter test on URL link 
The URL link log sheet which was subject to be 
crawled was checked to determine whether link 
standardization and deduplication operations should 
be performed or not. 
(3) Webpage data extraction test 
Whether the analysis module was corrected and 
could effectively extract data on webpage and store 
the data in relevant documents or not was checked. 
(4) Test on webpage category classification 
The system classified webpage into different 
categories and checked whether the classification 
was corrected or not. 
 
Figure 3: The implementation flow of the distributed crawler. 
Application of Distributed Web Crawlers … Informatica 42 (2018) 137–143 141 
 
5.1.2 Test results 
The system could do webpage crawling according to the 
prescribed initial URL set and added the crawled URLs 
into URLs which were subject to be crawled. 
Standardization and deduplication were performed before 
addition. The extracted data were stored in relevant 
documents. Moreover it could rapidly classify webpage. 
5.2 Performance test 
5.2.1 Test content and scheme 
(1) Test on collection scale 
After a period of webpage crawling, the size of the 
collected webpage data was calculated to measure 
the collection scale. 
(2) Test on operation speed 
During crawling, the size of the collected web data, 
i.e. x, was calculated after n hours of movement. The 
computational formula for crawling speed v was v = 
x/n. 
5.2.2 Test results 
Table 1 shows the data collection speed of the clusters 
based on four nodes. The operation of the system 
included webpage downloading, web analysis, extraction 
of record information on the network and classification 
of web text. This study could basically satisfy the 
requirements according to the data in Table 1. 
 
5.3 Test on expandability 
5.3.1 Test content and scheme 
Test on expandability: the number of nodes on Hadoop 
platform was changed. Then test was performed when 
the number of coordinated nodes was 1, 2 and 3 to 
determine whether the operation was normal and what 
were the effects on the performance of the system. 
5.3.2 Test results 
Figure 4 demonstrated the data collection and analysis of 
the system when the time and number of nodes were 
different. It could be noted that the operation speed was 
the highest when there was only one node; the operation 
speed had remarkable improvement with the increase of 
nodes, but the speed of each node had no significant 
changes. Through test, it was concluded that the 
expandability could satisfy the predetermined 
requirements. 
Internet plays an increasingly important role in the 
production and life of people and has been the main 
source of information. Distributed web crawlers can grab 
key data among mass data, which is greatly helpful to 
information acquisition. Bal et al. [15] put forward 
intelligent distributed crawler crawling network based on 
client-server architecture. In the architecture, load is 
managed by server. Every time when crawlers were 
loaded, URLs were dynamically allocated to allocate 
load to others, which enhanced the ability of information 
crawling. Kumar et al. [16] developed distributed 
semantic web crawlers and successfully crawled and 
Table 1: The operation results of the information management system. 
Number 1 2 3 4 5 
Segment name Segment20171
002093417 
Segment20171
00213672 
Segment20171
00360349 
Segment20171
00413725 
Segment20171
00547436 
Size (MB) 39.21 82.61 180.44 305.14 400.62 
Operation 
time (h) 
0.6 1.1 2.4 4.5 5.7 
 
 
Figure 4: The test results of system expandability. 
142 Informatica 42 (2018) 137–143 B. Wen 
 
utilized HTML compiled by owl/RDf and semantic web. 
In information management system, distributed web 
crawlers can give full play to its advantages because it 
can effectively crawl information needed among mass 
data and efficiently collect and manage them. The 
application of distributed web crawlers can achieve 
efficient and safe management of information and has 
high practicability.  
6 Conclusion 
In conclusion, distributed network crawlers based 
information management system could precisely satisfy 
the requirements of web crawling, with a high 
performance and expandability. Moreover it can 
effectively reduce repeated visit and download of 
resources to improve efficiency of information searching. 
It can also reduce the time and money spent on resource 
acquisition because of the low cost. Therefore it can be 
applied for extracting network information. This work 
provides a reference for the application of distributed 
network crawlers based information management system 
in data extraction. 
7 References 
[1] Qin Y., Xuan H., Zhang B. (2016).  Intelligent 
Management System of Power Network 
Information Collection Under Big Data Storage. 
13th Global Congress on Manufacturing and 
Management (GCMM 2016), MATEC Web of 
Conferences, Zhengzhou. 
[2] Gupta C. L. P., Sharma S., Tripathi S. (2015). 
Importance of Management Information System in 
Electronic-Information Era. East Carolina 
University, 1(2). 
[3] Zhao Q. A. (2016). Research and Implementation of 
Scientific Research Information Management 
System Based on the Topic Web Crawler. Anhui: 
Anhui University, pp. 1-46. 
[4] Su L., Wang F. (2017). Web crawler model of 
fetching data speedily based on Hadoop distributed 
system. IEEE International Conference on Software 
Engineering and Service Science, Beijing, pp. 927-
931. 
[5] Zhang X., Xian M. (2015). Optimization of 
Distributed Crawler under Hadoop. International 
Conference on Engineering Technology and 
Application, 22:02029. 
[6] Qu X., Hu R., Zhou L., Wang L., Zhu Q. (2015). 
Expert Achievements Model for Scientific and 
Technological Based on Association Mining. 
International Symposium on Distributed Computing 
and Applications for Business Engineering and 
Science, Guiyang, pp. 272-275. 
[7] Bahrami M., Singhal M., Zhuang Z. (2015). A 
cloud-based web crawler architecture. International 
Conference on Intelligence in Next Generation 
Networks, Paris, pp. 2016-223. 
[8] Pu Q. (2016). The Design and Implementation of a 
High-Efficiency Distributed Web Crawler. 
Dependable, Autonomic and Secure Computing, 
International Conference on Pervasive Intelligence 
and Computing, International Conference on Big 
Data Intelligence and Computing and Cyber 
Science and Technology Congress, Auckland, pp. 
100-104. 
[9] Kim M., Han S., Cui Y., Lee, H.  Cho H., S. 
Hwang. (2014). CloudDMSS: robust Hadoop-based 
multimedia streaming service architecture for a 
cloud computing environment. Cluster Computing, 
17(3): 605-628. 
[10] Bhagyashree E., Tanuja K.  (2015). Phishing URL 
Detection: A Machine Learning and Web Mining-
based Approach. International Journal of Computer 
Applications, 123. 
[11] Santhosh K. D. K., Kamath M. (2014). Design and 
implementation of competent web crawler and 
indexer using web services. International 
Conference on Advanced Communication Control 
and Computing Technologies, Ramanathapuram, pp. 
1672-1677. 
[12] Dąbek Osb T. M. (2012). Strengthen the faith as the 
task of the Pastors of the Church. The Apostles 
Peter and Paul as examples for the Pastors of the 
Church for proclaim and, Scriptura Sacra, (16): 19. 
[13] Dong C. (2015). Asymmetric color image 
encryption scheme using discrete-time map and 
hash value. Optik - International Journal for Light 
and Electron Optics, 126(20): 2571-2575. 
[14] Qiao Y., Yun X., Zhang Y.  (2016). Fast Reused 
Function Retrieval Method Based on Simhash and 
Inverted Index. Trustcom/BigDatase/ISPA, Tianjin, 
PP. 937-944. 
[15] Bal S. K., Geetha G. (2016). Smart distributed web 
crawler. International Conference on Information 
Communication and Embedded Systems, Chennai, 
pp. 1-5. 
[16] Kumar N. and Singh M. (2016). Framework for 
Distributed Semantic Web Crawler. International 
Conference on Computational Intelligence and 
Communication Networks, Jabalpur, pp. 1403-1407. 
 
Informatica 42 (2018) 143–143 143
JOŽEF STEFAN INSTITUTE
Jožef Stefan (1835-1893) was one of the most prominent
physicists of the 19th century. Born to Slovene parents,
he obtained his Ph.D. at Vienna University, where he was
later Director of the Physics Institute, Vice-President of
the Vienna Academy of Sciences and a member of several
scientific institutions in Europe. Stefan explored many
areas in hydrodynamics, optics, acoustics, electricity, mag-
netism and the kinetic theory of gases. Among other things,
he originated the law that the total radiation from a black
body is proportional to the 4th power of its absolute tem-
perature, known as the Stefan–Boltzmann law.
The Jožef Stefan Institute (JSI) is the leading indepen-
dent scientific research institution in Slovenia, covering
a broad spectrum of fundamental and applied research in
the fields of physics, chemistry and biochemistry, electro-
nics and information science, nuclear science technology,
energy research and environmental science.
The Jožef Stefan Institute (JSI) is a research organisation
for pure and applied research in the natural sciences and
technology. Both are closely interconnected in research de-
partments composed of different task teams. Emphasis in
basic research is given to the development and education of
young scientists, while applied research and development
serve for the transfer of advanced knowledge, contributing
to the development of the national economy and society in
general.
At present the Institute, with a total of about 900 staff,
has 700 researchers, about 250 of whom are postgraduates,
around 500 of whom have doctorates (Ph.D.), and around
200 of whom have permanent professorships or temporary
teaching assignments at the Universities.
In view of its activities and status, the JSI plays the role
of a national institute, complementing the role of the uni-
versities and bridging the gap between basic science and
applications.
Research at the JSI includes the following major fields:
physics; chemistry; electronics, informatics and compu-
ter sciences; biochemistry; ecology; reactor technology;
applied mathematics. Most of the activities are more or
less closely connected to information sciences, in particu-
lar computer sciences, artificial intelligence, language and
speech technologies, computer-aided design, computer ar-
chitectures, biocybernetics and robotics, computer automa-
tion and control, professional electronics, digital communi-
cations and networks, and applied mathematics.
The Institute is located in Ljubljana, the capital of the in-
dependent state of Slovenia (or S♥nia). The capital today
is considered a crossroad between East, West and Medi-
terranean Europe, offering excellent productive capabilities
and solid business opportunities, with strong international
connections. Ljubljana is connected to important centers
such as Prague, Budapest, Vienna, Zagreb, Milan, Rome,
Monaco, Nice, Bern and Munich, all within a radius of 600
km.
From the Jožef Stefan Institute, the Technology park
“Ljubljana” has been proposed as part of the national stra-
tegy for technological development to foster synergies be-
tween research and industry, to promote joint ventures be-
tween university bodies, research institutes and innovative
industry, to act as an incubator for high-tech initiatives and
to accelerate the development cycle of innovative products.
Part of the Institute was reorganized into several high-
tech units supported by and connected within the Techno-
logy park at the Jožef Stefan Institute, established as the
beginning of a regional Technology park ”Ljubljana”. The
project was developed at a particularly historical moment,
characterized by the process of state reorganisation, privati-
sation and private initiative. The national Technology Park
is a shareholding company hosting an independent venture-
capital institution.
The promoters and operational entities of the project are
the Republic of Slovenia, Ministry of Higher Education,
Science and Technology and the Jožef Stefan Institute. The
framework of the operation also includes the University of
Ljubljana, the National Institute of Chemistry, the Institute
for Electronics and Vacuum Technology and the Institute
for Materials and Construction Research among others. In
addition, the project is supported by the Ministry of the
Economy, the National Chamber of Economy and the City
of Ljubljana.
Jožef Stefan Institute
Jamova 39, 1000 Ljubljana, Slovenia
Tel.:+386 1 4773 900, Fax.:+386 1 251 93 85
WWW: http://www.ijs.si
E-mail: matjaz.gams@ijs.si
Public relations: Polona Strnad