Informatica 37 (2013) 381-386 381 Mining Web Logs to Identify Search Engine Behaviour at Websites Jeeva Jose Department of Computer Applications, BPC College, Mulakkulam North P.O, Piravom- 686664, Ernakulam District, Kerala, India, E-mail: vijojeeva@yahoo.co.in P. Sojan Lal School of Computer Sciences, Mahatma Gandhi University, Kottayam, Kerala, India E-mail: padikkakudy@gmail.com Keywords: web logs, web usage mining, search engines, crawler Received: April 3, 2013 Web Usage Mining also known as Web Log Mining is the extraction of user behaviour from web log data. The log files also provide immense information about the search engine traffic at a website. This search engine traffic is helpful to analyse the ethics of search engines, quality of the crawlers, periodicity of the visits and also the server load. Search engine crawlers are automated programs which periodically visit a website to update information. Crawlers are the main components of a search engine and without them the websites will not be listed in the search results. The visibility of the web sites depends on the quality of the crawlers. Different search engines may have different behaviour at web sites. We intend to see the differences in behaviour of search engines in terms of the number of visits and the number of pages crawled. The hypothesis was tested and it was found that there is a significant difference in the behaviour of search engines. Povzetek: Analizirano je obnašanje različnih spletnih iskalnih algoritmov. 1 Introduction Web Usage Mining is the extraction of information from web log files generated when a user visits the website [1]. Web mining tasks include mining web search engine data, analysing web's link structures, classifying web documents automatically, mining web page semantic structures and page contents, mining web dynamics (mining log files), building a multilayered and multidimensional web. Web log data is usually mined to study the user behaviour at websites. It also contains immense information about the search engine traffic. The user traffic is removed by pre processing tasks, otherwise it may bias the search engine behaviour. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of web search engines. The process of identifying the web crawlers is important because they can generate 90% of the traffic on websites [2]. Commercial search engines play a vital role in accessing web sites and wider information dissemination [3, 4]. Search engines use automated programs called web crawlers to collect information from the web. These web crawlers are also known as spiders, bots, robots etc. These crawlers are highly automated and seldom regulated manually [5, 6, 7]. The crawlers periodically visit the websites to update the content. Certain web sites like stock market sites or online news may need frequent crawling to update the search engine repositories. Web crawlers access the websites for diverse purpose which includes security violations also. Hence they may lead to ethical issues like privacy, security and blocking of server access. Crawling activities are regulated from server side with the help of Robots Exclusion Protocol. This protocol is present in a file called robots.txt. Usually ethical crawlers first access this file which will be present at the root directory of the website and follow the rules specified by robots.txt [8, 9]. But it is also possible to crawl the pages at a website without accessing the robots.txt. Certain crawlers seems to disobey the rules in robots.txt after its modification because crawlers like "Googlebot", "Yahoo! Slurp" , "MSNbot" cache the robots.txt file for a website [8]. The web site monitoring software Google Analytics does not track crawlers or bots. This is because Google Analytics tracking is activated by a JavaScript that is placed on every page of the website. A crawler hardly recognizes these scripts and hence the visits from search engines are not recognized. In this work we intend to see whether all the search engines are behaving in the same way when it accesses a website. The most widely used log file formats are Common Log File Format and Extended Log File Format. The Common Log File format contains the following information: a) user's IP address b) user's authentication name c) the date-time stamp of the access d) the HTTP request e) the URL requested f) the response status g) the size of the requested file. 382 Informática 37 (2013) 381-386 J. Jose et al. The Extended Log File format contains additional fields like a) the referrer URL b) the browser and its version and c) the operating system [11, 12]. Usually there are three ways of HTTP requests namely GET, POST and HEAD. Most HTML files are served via GET method while most CGI functionality is served via POST or HEAD. The status code 200 is the successful status code. Like the user access the website using a browser, the search engines also deploy user agents to access the web. 2 Background literature Most of the works in Web Usage Mining is related to user behaviour. This is because websites like ecommerce websites will be interested in studying user behaviour for marketing, online sales and personalization. Several data mining tasks like clustering, classification, association rule mining etc. has been done for web log data of user behaviour. The web crawler ethics are measured to discover the ethicality of commercial search engine crawlers [9]. A survey of the use of the Robots Exclusion Protocol on the web through statistical analysis of a large sample of robots.txt files is done [10]. An empirical pilot study on the relationship between JavaScript usage and web site visibility was carried out to identify whether JavaScript based hyperlinks attract or repel crawlers resulting in an increase or decrease in web site visibility [6]. Another study is done with commercial search engines to find whether there is a significant difference in their coverage of commercial web sites [4]. A report on search engine ratings in United States is also available [3]. 2.1 Preprocessing The two data sets were extracted and it was found that the dataset 1 consists of 5,29,175 records for 8 weeks and dataset 2 consists of 2,60,775 records. The entries with unsuccessful status code 400 were eliminated. The HTTP requests with POST and HEAD was also removed. In addition all the user requests were removed to get the search engine requests. This is required as a user request in the input file may bias the results of search engine behaviour. After pre processing the resultant file contained only the successful search engine requests. Various search engine crawlers were identified. Some crawlers were identified from the IP address field. It contained substrings like "googlebot", "baiduspider", "msnbot" etc. The user agents were also helpful in identifying the bots or crawlers like Ezooms, discobot etc. Certain search engine crawlers with number of visits less than 5 per week was removed as it was considered irrelevant.The bots Ahrefbot, Seexie.com_bot, Turnitinbot, Yrspider were some of the bots in data set 1 whose number of visits were less than 5 in a week. For data set 2 the Alexabot was considered irrelevant. The crawlers in dataset 1 like Baiduspider, Discobot, Exabot, Feedtetcher-Google, Feedseeker, Gosospider, Ichiro, Magpie, MJ12bot, MSNbot, Seexie.com_bot, Slurp, Sogou, Sosospider, SpBot, Turnitinbot, Yahoo, Yeti, Yodao, Youdao and YrSpider were not present in dataset 2. After pre processing there were 22 crawlers for data set 1 and 5 crawlers for data set 2. The results for the number of visits made by various search engines of data set 1 is given in Table 1 and for data set 2 is given in Table 2. We also intend to see the number of pages crawled by various search engines to see the dynamic behaviour of different search engines. Most of the search engines initially accessed the robots.txt file before crawling other pages except a few. Certain search engines crawled more pages compared with other bots or crawlers. For example the crawlers like Googlebot, Slurp, Bingbot, Feedfetcher-google, MJ12 etc crawled more number of pages and showed consistency in their behaviour. Table 3 shows the number of pages crawled by various search engines for data set 1 and Table 4 shows the result for data set 2. 2.2 Kruskal Wallis H test Kruskal Wallis H Test detects if n data groups belong or not to the same population [13, 14]. This statistic is a non parametric test suitable to distributions that are not normal such as the exponential distributions observed in web usage mining or web log analysis [15]. The formula for H static of Kruskal- Wallis test is given below where K is the number of samples. H = 12 Rj2 hi = 1 nj N(N+1) 3(N + 1) (1) where Rj is the sum of the ranks of the sample j, nj is the size of the sample j, j=1, 2, 3, ...K and N is the size of the pooled sample (m+n2+........nK). The calculated H value is to be compared against the chi-square value with (K-1) degrees of freedom at the given significance level a. Case I Ho: There is no significant difference between the number of visits made by various search engine crawlers. H1: There is significant difference between the number of visits made by various search engine crawlers. From the test statistic in Table 5, both the data sets show a clear evidence of rejecting the null hypothesis. For data set 1, the p-value shows a strong evidence of rejecting the null hypothesis and for data set 2 shows a moderate evidence of rejecting the null hypothesis. The result of H test shows that there is a significant difference in the number of visits made by various search engines. Case II Ho: There is no significant difference between the number of pages crawled by various search engine crawlers. H1: There is significant difference between the number of pages crawled by various search engine crawlers. Mining Web Logs to Identify Search. Informatica 37 (2013) 381-386 383 Table 1: No: of visits by various crawlers for data set 1. Week No Crawler 1 2 3 4 5 6 7 8 Total ^ 0 1 Alexa 1 5 10 1 2 0 2 3 24 3.00 3.207 2 Baiduspider 128 222 65 89 124 67 66 47 808 101.00 56.87 3 Bingbot 157 166 159 175 126 100 118 96 1097 137.13 30.94 4 Discobot 113 33 0 21 24 52 5 69 317 39.63 37.42 5 Exabot 1 1 2 1 5 3 3 3 19 2.38 1.408 6 Ezooms 50 48 40 22 0 23 38 41 262 32.75 16.74 7 Feedfetcher-Google 179 170 167 223 192 191 187 188 1497 187.13 17.28 8 Googlebot 211 226 238 273 212 207 200 207 1774 221.75 23.99 9 Gosospider 26 10 1 0 0 0 0 0 37 4.63 9.303 10 Ichiro 117 81 122 146 0 42 21 33 562 70.25 53.8 11 Magpie 20 17 13 15 13 15 14 18 125 15.63 2.504 12 MJ12bot 38 36 37 50 37 37 37 41 313 39.13 4.643 13 MSNbot 24 17 11 19 15 12 18 15 131 16.38 4.138 14 Slurp 149 114 144 190 144 145 160 145 1191 148.88 21.07 15 Sogou 48 34 37 54 40 44 43 60 360 45.00 8.701 16 Sosospider 28 31 42 38 31 32 30 28 260 32.50 4.957 17 SpBot 3 3 3 4 2 2 1 1 19 2.38 1.061 18 Yandex 51 71 57 72 102 44 51 74 522 65.25 18.64 19 Yahoo 22 0 0 0 0 1 1 0 24 3.00 7.69 20 Yeti 3 4 1 4 3 2 4 4 25 3.13 1.126 21 Yodao 16 59 26 100 72 42 10 32 357 44.63 30.6 22 Youdao 2 4 1 1 18 1 3 0 30 3.75 5.898 Table 2: No: of visits by various crawlers for data set 2. Week No Crawlers 1 2 3 4 5 6 7 8 Total 0 1 Ahrefsbot 79 0 1 19 37 66 31 48 281 35.13 28.6 2 Bingbot 31 41 27 43 23 30 28 17 240 30 8.64 3 Ezooms 3 20 26 38 26 24 9 28 174 21.75 11.1 4 Googlebot 42 49 42 44 42 49 35 60 363 45.38 7.41 5 Yandex 35 10 67 88 6 7 3 12 228 28.5 32.3 Crawler Figure 1: Time series sequence plot for data set 1. 382 Informática 37 (2013) 381-26 J. Jose et al. Table 3. No: of pages crawled by various crawlers for data set 1 Week No Crawler 1 2 3 4 5 6 7 8 Total f 0 1 Alexa 2 13 27 2 4 0 4 4 56 7.00 8.96 2 Baiduspider 219 674 102 124 260 98 94 90 1661 207.63 199.03 3 Bingbot 368 559 519 526 404 232 287 647 3542 442.75 143.30 4 Discobot 889 161 0 119 92 289 6 178 1734 216.75 287.42 5 Exabot 2 11 4 2 11 6 5 6 47 5.88 3.52 6 Ezooms 235 160 77 57 65 59 83 67 803 100.38 63.79 7 Feedfetcher-Google 386 343 340 493 442 447 443 417 3311 413.88 53.81 8 Googlebot 841 895 682 847 655 525 540 556 5541 692.63 150.42 9 Gosospider 34 11 1 0 0 0 0 0 46 5.75 12.03 10 Ichiro 230 277 387 414 320 234 45 291 2198 274.75 113.86 11 Magpie 23 21 18 23 16 16 18 22 157 19.63 2.97 12 MJ12bot 174 304 224 392 255 285 294 316 2244 280.50 65.06 13 MSNbot 31 24 13 28 17 15 18 18 164 20.50 6.44 14 Slurp 367 253 297 410 310 264 308 331 2540 317.50 51.79 15 Sogou 72 42 47 61 52 54 51 80 459 57.38 12.89 16 Sosospider 32 38 57 42 36 36 35 33 309 38.63 8.03 17 SpBot 6 6 6 8 4 4 2 2 38 4.75 2.12 18 Yandex 140 250 99 171 216 102 212 276 1466 183.25 66.20 19 Yahoo 22 0 0 0 0 0 0 0 22 2.75 7.78 20 Yeti 6 9 2 7 7 4 7 7 49 6.13 2.17 21 Yodao 16 59 27 102 75 43 10 34 366 45.75 31.29 22 Youdao 4 8 2 2 25 2 7 2 52 6.50 7.86 Table 4: No: of pages crawled by various crawlers for data set 2. Week No Crawler 1 2 3 4 5 6 7 8 Total f 0 1 Ahrefsbot 282 0 1 19 108 119 46 74 649 81.13 93.08 2 Bingbot 66 172 158 251 102 90 78 48 965 120.63 68.03 3 Ezooms 3 23 35 51 32 36 9 40 229 28.63 16.08 4 Googlebot 74 92 83 99 90 95 65 83 681 85.13 11.33 5 Yandex 39 18 123 199 6 7 4 13 409 51.13 71.65 The test statistic in Table 6 also shows that there is significant difference in the number of pages crawled by various search engines. The p-value for both the datasets is a strong evidence of rejecting the null hypothesis. A time series sequence plot was done for both data sets with total number of visits and total number of pages crawled. The result for data set 1 is shown in Figure 1 and for data set 2 is shown in Figure 2. We also intend to see whether there exists any correlation between the number of visits and number of pages crawled. The Karl Pearson's Correlation Coefficient [14] was calculated for both data sets. The data set 1 showed a strong positive correlation of 0.932 whereas the data set 2 showed a moderate positive correlation of 0.505. 3 Conclusion The obtained results point to the differences in the behaviour of web crawlers by various search engines. The more the number of search engines accessing a website, the more will be its visibility when searching for a particular web site. The observed results show that all search engine crawlers are not visiting all the websites. In our experiment the data set 1 was accessed by more number of search engines compared to data set 2. Certain search engines were consistent in the number of visits and number of pages crawled while a few were not consistent or irregular in their visits and pages crawled. It is found that data set 1 is more visible to search engine crawlers as it is crawled by more number of search engines compared to data set 2. The results also showed a positive correlation between the number of visits and number of pages crawled. A better search engine optimization policy can be followed to make the websites visible to different search engines so that the websites will be listed top in the search engine rankings. Mining Web Logs to Identify Search. Informatica 37 (2013) 381-386 383 Figure 2: Time series sequence plot for data set 2. Table 5: Test Statistic for Case I. Kruskall Wallis Test Data Set 1 Data Set 2 a 0.01 0.01 p-value 0.0001 0.044 Chi-square 148.734 9.799 df 21 4 Acknowledgement This research work is supported by Kerala State Council for Science Technology and Environment, Kerala State, India as per Order No.009/SRSPS/2011/CSTE . References [1] Kosala, R. And Blockeel, H., Web Mining Research: A Survey. ACM SIGKDD Explorations. 2(1), pp. 1-15, 2000. [2] Mican, D. And Sitar-Taut, D., Preprocessing and Content/Navigational Pages Identification as Premises for an Extended Web Usage Mining Model Development. Informatica Economica, 13(4), pp. 168-179, 2009. [3] Sullivan, D.2003, Webspin : Newsletter [online]. Available from: http://contentmarketingpedia.com/Marketing-Library/Search/industryNewsSeptA1 .pdf .Accessed December 4, 2012. [4] Vaughan, L. And Thelwall, M., Search Engine Coverage Bias: Evidence and Possible Causes, Information Processing and Management, 40(4), pp. 693-707, 2004. [5] Bhagwani, J. And Hande, K., Context Disambiguation in Web Search Results Using Clustering Algorithm. International Journal of Computer Science and Communication, 2(1), pp. 119-123, 2011. Table 6: Test Statistic for Case II. Kruskall Wallis Test Data Set 1 Data Set 2 a 0.01 0.01 p-value 0.0001 0.013 Chi-square 154.85 12.714 df 21 4 [6] Schwenke, F. And Weideman, M., The influence that JavaScript has on the visibility of a website to search engines - a pilot study. Informatics & Design Papers and Reports, 11(4), pp. 1-10, 2006. [7] Thelwall, M., A Web Crawler Design for Data Mining, Journal of Information Science, 27(5), pp. 319-325, 2001. [8] Drott, M, Indexing aids at corporate websites: The use of robots.txt and meta tags. Information Processing and Management, 38(2), pp. 209-219, 2002. [9] Lee Giles, C., Sun, Y and Council, G., I., Measuring the Web Crawler Ethics. In: Proceedings of WWW 2010, ACM, pp. 11011102, 2010. [10] Sun, Y. Zhuang, Z. .and Lee Giles, C., A Large-Scale Study of Robots.txt. In: Proceedings of WWW2007, ACM, pp. 1123-1124, 2007. [11] Wahab, M.H,A, Mohd, M.N.H, Hanafi, H. F. Mohsin, M. F.M., Data Pre-processing on Web Server Logs for Generalized Association Rules Mining Algorithm. In: Proceedings of World Academy of Science Engineering and Technology, pp.190-196, 2008. [12] Spiliopoulou, M., Web Usage Mining for Web Site Evaluation. Communications of the ACM, 43(8), pp. 127-134, 2000. [13] Kruskal,W. H. And Wallis, W. A., Use of Ranks in one-criterion Variance analysis. Journal of the American Statistical Association, 47(260), pp. 583-621, 1952. 382 Informática 37 (2013) 381-28 J. Jose et al. [14] Paneerselvam, R., Research Methodology. New Delhi, Prentice Hall of India Private Limited, 2005. [15] Ortega, J., L. And Aguillo, I., Differences between web sessions according to the origin of their visits, Journal of Infometrics, 4, pp. 331 -337, 2010.