Informatica An International Journal of Computing and Informatics Learning in Web Search Guest Editors: Stephan Bloehdorn Wray Buntine Andreas Hotho The Slovene Society Informatika, Ljubljana, Slovenia EDITORIAL BOARDS, PUBLISHING COUNCIL Informatica is a journal primarily covering the European computer science and informatics community; scientific and educational as well as technical, commercial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international referee-ing. It publishes scientific papers accepted by at least two referees outside the author's country. In addition, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and information industry are presented through commercial publications as well as through independent evaluations. Editing and refereeing are distributed. Each editor from the Editorial Board can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author's country. If new referees are appointed, their names will appear in the list of referees. Each paper bears the name of the editor who appointed the referees. Each editor can propose new members for the Editorial Board or referees. Editors and referees inactive for a longer period can be automatically replaced. Changes in the Editorial Board are confirmed by the Executive Editors. The coordination necessary is made through the Executive Editors who examine the reviews, sort the accepted articles and maintain appropriate international distribution. The Executive Board is appointed by the Society Informatika. Informatica is partially supported by the Slovenian Ministry of Science and Technology. Each author is guaranteed to receive the reviews of his article. When accepted, publication in Informatica is guaranteed in less than one year after the Executive Editors receive the corrected version of the article. Executive Editor - Editor in Chief Anton P. Železnikar Volariceva 8, Ljubljana, Slovenia s51em@lea.hamradio.si http://lea.hamradio.si/~s51em/ Executive Associate Editor (Contact Person) Matjaž Gams, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 219 385 matjaz.gams@ijs.si http://ai.ijs.si/mezi/matjaz.html Deputy Managing Editor Mitja Luštrek, Jožef Stefan Institute mitja.lustrek@ijs.si Executive Associate Editor (Technical Editor) Drago Torkar, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 219 385 drago.torkar@ijs.si Editorial Board Suad Alagic (USA) Anders Ardo (Sweden) Vladimir Batagelj (Slovenia) Francesco Bergadano (Italy) Marco Botta (Italy) Pavel Brazdil (Portugal) Andrej Brodnik (Slovenia) Ivan Bruha (Canada) Wray Buntine (Finland) Hubert L. Dreyfus (USA) Jozo Dujmovic (USA) Johann Eder (Austria) Vladimir A. Fomichov (Russia) Janez Grad (Slovenia) Hiroaki Kitano (Japan) Igor Kononenko (Slovenia) Miroslav Kubat (USA) Ante Lauc (Croatia) Jadran Lenarcic (Slovenia) Huan Liu (USA) Suzana Loskovska (Macedonia) Ramon L. de Mantras (Spain) Angelo Montanari (Italy) Pavol Nävrat (Slovakia) Jerzy R. Nawrocki (Poland) Nadja Nedjah (Brasil) Franc Novak (Slovenia) Marcin Paprzycki (USA/Poland) Gert S. Pedersen (Denmark) Karl H. Pribram (USA) Luc De Raedt (Germany) Dejan Rakovic (Serbia and Montenegro) Jean Ramaekers (Belgium) Wilhelm Rossak (Germany) Ivan Rozman (Slovenia) Sugata Sanyal (India) Walter Schempp (Germany) Johannes Schwinn (Germany) Zhongzhi Shi (China) Oliviero Stock (Italy) Robert Trappl (Austria) Terry Winograd (USA) Stefan Wrobel (Germany) Xindong Wu (USA) Publishing Council: Tomaž Banovec, Ciril Baškovic, Andrej Jerman-Blažic, Jožko (Ćuk, Vladislav Rajkovic Board of Advisors: Ivan Bratko, Marko Jagodic, Tomaž Pisanski, Stanko Strmcnik Editors' Introduction to the Special Issue "Learning in Web Search Introduction The emerging world of search we see is one which makes increasing use of information extraction, gradually blends in semantic web technology and peer to peer systems, and employs grid style computing resources for information extraction and learning. This Informatica special issue explores the theory and application of machine learning in this context for the internet, intranets, the emerging semantic web and peer to peer search. Search can also be viewed as a knowledge sharing service on the Web, as an interface to the Semantic Web. While some automation in building the Semantic Web has been achieved, it remains a labour intensive annotation process with problems in scaling up to the full free-text Web. A partial implementation of semantic-based search is possible where hierarchical concept spaces rather than full ontologies are used, and where information extraction and learning tools in the search engine perform approximate tagging of concepts. This partial semantic-based search could be viewed as a key infrastructure for more complete Semantic Web development, and arguably, as a safety net for it. of a metasearch engine that exploits personal user search spaces. We thank the authors, reviewers and the Informatica editors for their efforts to ensure the quality of accepted papers and to make the reading as well as the editing of this special issue a rewarding activity. Stephan Bloehdorn, Wray Buntine and Andreas Hotho Overview of the issue The articles in this special issue originate from two different backgrounds. Two articles are from the EU IST project ALVIS1, which aims to bring web search infrastructure closer to the vision of the semantic web by automating some of the labor intensive annotation processes. The article Semantic Search in Tabular Structures by Aleksander Pivk, Matjaz Gams and Mitja Lustrek explores techniques for making tables and their content the subject of search while also considering the implicit semantics in the tabular structure. In the contribution Beyond term indexing: A P2P framework for Web information retrieval, the authors Ivana Podnar, Martin Rajman, Toan Luu, Fabius Klemm and Karl Aberer present a new framework for full-text information retrieval in P2P overlay networks and introduce a novel retrieval model based on highly discriminative keys. Two further papers are selected papers from the workshop "Learning in Web Search" held at the International Conference on Machine Learning (ICML) in 2006 organized by the editors of this issue. The article A Semantic Kernel to classify Texts with very few Training Examples by Roberto Basili, Marco Cammisa and Alessandro Moschitti contributes to the field of using semantic background knowledge in the context of kernel methods. The article Sailing the Web with Captain Nemo: a Personalized Metasearch Engine by Stefanos Souldatos, Theodore Dalamagas and Timos Sellis presents the implementation 1http://www.alvis.info/ Semantic Search in Tabular Structures Aleksander Pivk1, Matjaž Gams1 and Mitja Luštrek12 1 Department of Intelligent Systems, Jozef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia 2 Department of Computer Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8 E-mail: {aleksander.pivk, matjaz.gams, mitja.lustrekj@ijs.si Keywords: tabular structures, ontology learning, semantic web, query answering Received: November 18, 2005 The .Semantic Web search aims to overcome the bottleneck of finding relevant information using formal knowledge models, e.g. ontologies. The focus of this paper is to extend a typical search engine with semantic search over tabular structures. We categorize HTML documents into topics and genres. Using the TARTAR system, tabular structures in the documents are then automatically transformed into ontologies and annotated to build a knowledge base. When posting queries, users receive responses notjust as lists of links and description extracts, but also enhanced with replies in the form of detailed structured data. Povzetek: Razvili smo metode semantičnega spleta za iskanje informacij v tabelah. 1 Introduction The World Wide Web has in the years of its exponential growth become a universal repository of human knowledge and culture, thus enabling an exchange of ideas and information on a global level. The tremendous success of the Internet is based on its usage simplicity, efficiency, and enormous market potential [4]. The success of the World Wide Web is countervailed by efforts needed to search and find relevant information. The search of interesting information turned out to be a difficult, time-consuming task, especially due to the size, poor structure and lack of organization of the Internet [3, 7, 31]. A number of approaches appeared in the last decade with a common objective to improve searching and gathering of information found on the Web. One of the first solutions to cope with the information overload was search engines. In order for a search engine to function properly and to return relevant and satisfactory answers to user queries, it must conduct two important tasks in advance. The first task is to crawl the Web and gather, following the hyperlinks, as many documents as possible. The second task deals with document indexing [12, 35], hyperlink analysis, document relevance ranking [20, 23] and high dimensional similarity searches [13, 19]. Once these tasks are completed, a user may post queries to gain answers. User queries, unless they include highly selective keywords, tend to match a large number of documents, because they do not contain enough information to pinpoint most highly relevant resources [28]. They may sometimes even miss the most relevant responses, because of no direct keyword matches, but the most common disadvantage of this approach is that an engine might return thousands of potentially interesting links for a user to manually explore. The study described in [17] showed that the number of keywords in queries is typically smaller than three, which clearly cannot sufficiently narrow down the search space. In principle, this can be seen as the problem of users of search engines themselves. Reducing the ambiguity of search requests can be achieved by adding semantics, i.e. by making computers better 'understand' both the content of the web pages and the users' intentions, which can help to improve search results even with a request of a very limited size. In addition, search engines favor largest information providers due to name-branding and time optimization. To overcome this bottleneck, the Semantic Web search, where information is structured in a machine interpretable way, is a natural step forward, as originally envisioned by Tim Berners-Lee [4]. Moving to a Semantic Web, however, requires the semantic annotation of Web documents, which in turn crucially depends on some sort of automatic support to facilitate this task. Most information on the Web is presented in semi-structured and unstructured documents, i.e. loosely structured natural language text encoded in HTML, and only a small portion represents structured documents [2,12]. A semi-structured document is a mixture of natural language text and templates [2, 12]. The lack of metadata that would precisely annotate the structure and semantics of documents and ambiguity of natural language in which these documents are encoded makes automatic computer processing very complex [9, 21]. Tabular structures (i.e. tables or lists) are incorporated into semi-structured documents and may have many different forms and can also differ substantially even if they represent the same content or data [14, 16]. Here we consider tabular structures as tables and lists which are described by a particular tag in HTML, i.e. represents a table, where