Os r-) d' . & rn I' li o o Ci C3 O C o Volume 25 Number 3 October 2001 ISSN 0350-5596 Informatica An International Journal of Computing and Informatics Special Issue: Knowledge Based Software Engineering Information Technology Guest Editors: Pavol Navrat Cene Bavec, Matjaž Gams The Slovene Society Informatika, Ljubljana, Slovenia Informatica -An-Intemational Journal of Computing and Informatics Archive of abstracts may be accessed at USA: http://, Europe: http://ai.ijs.si/informatica, Asia: - http://www.comp.nus.edu.sg/ liuh/Infonnatica/index.html. - Subscription Information Informatica (ISSN 0350-5596) is published four times a year in Spring, Summer, Autumn, and Winter (4 issues per year) by the Slovene Society Informatika, Vožarski pot 12,1000 Ljubljana, Slovenia. iThe subscription rate for 2001 (Volume 25) is - USD 80 for institutions, - USD 40 for individuals, and - USD 20 for students Claims for missing issues will be honored free of charge within six months after the publication date of the issue. M] -MT^-Tech. Support: Borut Žnidar, Kranj, Slovenia. Lectorship: Fergus F. Smith, AMIDAS d.o.o., Cankarjevo nabrežje 11, Ljubljana, Slovenia. Printed by Biro M, d.o.o., Žibertova 1,1000 Ljubljana, Slovenia. Orders for subscription may be placed by telephone or fax using any major credit card. Please call Mr. R. Mum, - Jožef Stefan Institute: Tel (+386) 1 4773 900, Fax (+386) 1 219 385, or send checks or VISA card number or use the bank account number 900-27620-5159/4 Nova Ljubljanska Banka d.d. Slovenia (LB 50101-678-51841 for domestic subscribers only). Jnfprmatica js published in cooperation with the following societies (and contact persons): sRobotics Society of Slovenia (Jadran Lenarčič) 'Slovene Society;for;Pattern Recognition (Franjo Pemuš) :SloveniänIföificidJntelligcnce Society; Cognitive Science Society (Matjaž Gams) Slovenian Society, of Mathematicians, Physicists and Astronomers (Bojan Mohar) Automatic.Control Society, of Slovenia (Borut Zupančič) SloyenianrAssociation:ofETechnical and Natural Sciences / Engineering Academy of Slovenia (Igor Grabec) Informatica is surveyed by::AI and Robotic Abstracts, AI References, ACM Computing Surveys, ACM Digital Library, Applied Science & Techii. Index, COMPENDEX*PLUS, Computer ASAP, Computer Literature Index, CuriJ5Qntr.&J5pmp^&JVIath._Sear., Current Mathematical Publications, Cybemetica Newsletter, DBLP Computer Science Bibliography, Engineering Index, INSPEC, Linguistics and Language Behaviour Abstracts, Mathematical ReYÌews,--.MathScirSQciological Abstracts. Uncover, Zentralblatt für Mathematik The issuing of the Mormatica journal is fìnancially supported by the Ministry for Science and Technology, Slovenska 50, 1000 Ljubljana, Slovenia. Post tax payed at postfl l 02'Ljubljana; Slovenia taxe Perdue. Knowledge based software engineering (Introduction to special issue) Special Issue Editor: Pavol Navrat Slovak University of Technology Department of Computer Science and Engineering Ilkovičova 3, 812 19 Bratislava, Slovakia navrat@elf.stuba.sk, http: \vww.elf.stuba.sk/~navrat/ Recent advances in the field of Software Engineering show that the field not only gradually matures but there are some important developments going on. This special issue is intended to give the readers of Informatica a focused view on some very interesting results in an area that is frequently named Knowledge Based Software Engineering (KBSE). Research in this area aims at advancing Software Engineering towards more automated methods by incorporating explicitly represented knowledge from the problem domain but also from the solution domain, i.e. design and programming knowledge. Research is taking inspiration also from Artificial Intelligence techniques in a wider sense. The area of Knowledge Based Software Engineering has, despite its young age, a noteworthy record of history of its development. It is a standard practice that researchers in an emerging area first start gathering at scientific workshops and conferences. There have been at least two strings of international conferences devoted to Knowledge Based Software Engineering. In 1983, Rome Air Development Center published a report calling for the development of a Knowledge-Based Software Engineering Assistant which would employ artificial.intelligence techniques to support all phases of the software development process (this report appears in [I] ). This initiated a series of regular conferences. In 1991, the conference was named the Knowledge-Based Software Engineering Conference, KBSE-6. After KBSE-11 in 1996, the decision was made to change the name to Automated Software Engineering Conference. The other one is Joint Conference on Knowledge Based Software Engineering, which started in 1994 as a joint effort primarily of scientists from Japan and Eastern Europe. The Conference is well alive taking place biannually as a truly international gathering with participants from all over the world. Since 1998 in Slovakia, rigorously refereed and selected papers submitted to the Conference are published in a monograph book published by lOS Press [2] [3] . The next Conference will be in Maribor in 2002 [4] . Such conferences are but one of several forums of communication. There are several international scientific journals where papers from this area can be found. However, this special issue of Informatica is one of the first such devoted to Knowledge Based Software Engineering (another one to be named is the special issue of the lEICE Transactions on Information and Systems from 2000 based on papers from the 1998 JCKBSE Conference [5] . It is an extraordinary honour for me that the Editorship of Informatica have entrusted me with editing this special issue, for which I am very thankful. I have take special care in preparing this issue. Papers were solicited in a regular process that started with a call for papers. Thus, this special issue is not a collection of papers presented at some conference. All the submissions were prepared originally for this special issue. Here are the main topics from the call for papers showing roughly what is understood under the description Knowledge Based Software Engineering: • software architecture and component reuse • agents on internet and intranet • domain modelling • semiformal and formal specifications • design patterns • post-object-oriented programming paradigms including • agent-oriented • aspect-oriented • generative programming • multiparadigm appraches • knowledge discovery • program understanding and learning. Response from prospective authors has been quite encouraging. I received some dozen submissions. After a preliminary review, 9 submissions were subject to a further rigorous refereeing process. Here, work of all the assigned reviewers has been invaluable for a successful completion of the process. Their carefully elaborated reviews were essential in deciding on acceptance and rejection. Finally I selected 4 papers for the special issue which amounts for an acceptance rate well below 50%. I wish to emphasise this fact, which illustrates the high standard of quality of Informatica. I wish to thank all the authors who submitted to the special issue. There could be found many interesting ideas in all the submissions. Even some submissions, which ultimately had to be rejected, were very good papers. The papers have been accepted solely based on their scientific merit. However, as it turned out, their themes spread nicely across a wider spectrum of topics so the readers get an opportunity to make themselves acquainted with several interesting problems and approaches to solving them. One of the central themes in the current Knowledge Based Software Engineering research is to advance understanding of design patterns, frameworks, and applications. All these concepts can be viewed also as various kinds of knowledge. Knowledge Engineering offers a different conceptual framework to view and to deal with such knowledge. But there are more common concepts that can be found in both software systems and intelligent systems: let us name just intelligent software agents, and components. In the collection of papers, the first one by Soundarajan and Fridella is titled Understanding 00 Frameworks and Applications: An Incremental Approach. When reusing the design built into a framework, application developers need to understand behaviour of such frameworks. Authors propose techniques that will allow the framework developer to specify the framework behaviour precisely and allow the application developer to make use of it when building their application from the framework. The second paper Towards Software Design Automation with Patterns by Sikici and Topaloglu focuses on software development process based on applying design patterns. A new approach to describing patterns, viewed as knowledge pieces, is presented. Patterns as represented by their descriptions can be combined or abstracted in the development process. The next paper Support of Knowledge Management in Distributed Environment by Paralič, Paralič and Mach deals with a slightly different topic: knowledge management within an organisation. Their special focus is on organisations with distributed environments. Their approach makes use of mobile agents and distributed programming. Resulting system provides support to distributed organisational memory as a way of domain knowledge modelling. The fourth paper by Lee and Tsai is on A Novel Agent to Efficiently Extract the Top K Interesting Target Ranks from Web Search Engines. They tackle a rather specific problem, which however falls within a very important area of efficient Web searching. The paper shows also how research focus in Software Engineering shifts towards problems connected with Internet applications. Finally, I should like to express my hope that readers will find the special issue interesting and inspiring. My thanks go to the following reviewers: Pavel Brazdil Mark Nissen Davide Bugali Mel Ó Cinnéide Peter Dolog Amnon Eden Ivan Kapustik Ann Macintosh Jerzy Nawrocki Roumen Nikolov Jens Penberg Gabika Polcicovà Wilhelm Rossak Zhongshi Shi Maria Smolàrovà Gheorge Tecuci Valentino Vranić Stefan Wrobel I wish to thank Professor Matjaž Gams for his perfect support in my work. References [1] Readings in Artificial Intelligence and Software Engineering, Morgan Kaufman, 1986. [2] Navrat P. & Ueno H. (Eds.) (1998): Knowledge-Based Software Engineering. Frontiers in Artificial Intelligence and Applications, Vol. 48, TOS Press, Amsterdam, 334 pp. [3] Hruška T. & Hashimoto IVI. (Eds.) (2000): Knowledge-Based Software Engineering. Frontiers in Artificial Intelligence and Applications, Vol. 62, lOS Press, Amsterdam, 342 pp. [4] Joint Conference on Knowledge Based Software Engineering 2002: http://kbse.cs.shinshu-u.ac.ip/conf/cQnfhtml [5] Special Issue on Knowledge-Based Software Engineering, lEICE Transactions on Information and Systems, Vol. 83-D, No. 4,: 2000. Understanding OO frameworks and applications: an incremental approach Neelam Soundarajan and Stephen Fridella Computer and Information Science Ohio State University, Columbus, OH 43210 e-mail: {neelam,fridella}©cis.ohio-state.edu Keywords: Object-oriented application frameworks, software reuse, formal specifications, verification Received : March 18,2001 Object-Oriented frameworks provide a powerful technique for developing groups of related applications. Individual designers may develop applications tailored to their specific needs by providing appropriate definitions for the hook methods, while reusing the design built into the framework. One important problem in using this approach is the lack of suitable techniques for precisely documenting the behavior of such frameworks. If such techniques were available and were used to document framework behavior, application developers would be able to more precisely understand the framework, be able to more effectively build their applications from the framework, and be able to reason more reliably about the behavior of these applications. Our goal is to develop such techniques. The techniques we develop will allow the framework designer to specify the framework behavior precisely, and allow the application developer to 'plug-in' information about the behavior of the hook methods into the specification of the framework to arrive at the behavior of the application. We illustrate the technique by applying it to a simple case study. 1 Introduction Object-Oriented (00) frameworks [8, 16] promise to dramatically reduce the time and effort needed to develop complete applications. The framework designer designs the main classes and methods, including key methods that direct the flow-of-control and call appropriate methods of various classes. These key methods are referred to as template methods in the design patterns literature [10], the methods they call being the hook methods. In order to implement a complete application based on such a framework, the application developer treats (some of) the classes of the framework as base classes and defines derived classes that include definitions for the hook methods that implement behavior appropriate to his' particular application. A different developer could produce a (possibly very) different application by providing an alternate set of definitions for the hook methods. In effect, the framework designer implements behavior that is common to the various applications that may be built on this framework, including such behavior as the patterns of calls implemented in the template methods, and the individual application developer enriches this common behavior in ways appropriate to his particular application by defining (or redefining) the appropriate hook methods. If the framework has been designed carefully and provides the right hooks, an entire new application can be developed with just the incremental effort involved in designing a new set of definitions for the hook methods. But it is not sufficient to be able to build these applications, we also need to be able to reason about them so 'Following standard practice, we use 'he', 'his', etc. in place of 'he or she', 'his or her' etc. that we can understand and document their behavior. And the reasoning technique should itself be incremental. In other words, we should be able to reason just once about a given framework and arrive at a suitable specification for it; then, when a new application is built on this framework, we should be able to 'plug-in' appropriate information about the (behavior of the) hook methods defined in the application to arrive at a suitable specification of the behavior of the entire application. Unless the reasoning technique is incremental in this sense, the framework approach will be little more than a mechanism for code reuse. Our main goals in this paper are to develop an incremental reasoning technique for frameworks and applications, and to demonstrate it on a simple case study. In our approach to documenting frameworks, we will specify the behavior of each method of each class in terms of a pre-condition and a post-condition [21]. The key problem we have to address is that of specifying the behavior of the template methods of the framework in a way that suits the needs of the application-developer. Suppose, for example, that is a template method of the framework. We will see the full details later, but the key is to include, in the specification of t{), information about which hook methods t{) invokes, the objects it invokes them on, the order it invokes the methods in, etc. In order to provide this type of information, we will make use of a trace, denoted by the symbol r, to record the hook method calls that the template method makes. Our traces are somewhat similar to the traces used in reasoning about distributed systems [22] except that whereas we use them to record information about hook method calls, in dealing with distributed systems they are used to record information about communications be- tween the processes. The specification of t{), in particular its post-condition, will give us information not only about the final values of the member variables but also information about which hook methods t{) invoked along the way, on what objects, etc. As we will see, the application developer will then be able to plug into this specification, the behaviors corresponding to his particular (re-)definitions of the hook methods, to arrive at the corresponding new behavior of t{). There is one key requirement that hook method (re-)definitions must satisfy. Suppose X.h{) is one of the hook method invocations in t{), X being the object and hQ the method invoked. In arriving at the specification of t{) by analyzing its code in the framework, the (framework) designer would have made some assumptions about the effects of this call to h{). Typically, these would correspond to the behavior exhibited by the default definition of h{) provided as part of the framework, or if no default definition is provided, then the behavior that is expected to be common to the different definitions that will be provided in the different applications built on this framework. This default (or expected) behavior will be part of the specification of the framework. Unless the (re-)definition of h{) in the application satisfies this specification, the analysis of t{) may no longer be valid, and we would be forced to reanalyze the body of t{). Since we want to avoid such re-analysis of the framework code, we will require that the redefinition of h{) in the application satisfy its specification in the framework. Such a requirement is not new to our work; it is the key idea underlying the work on behavioral subtyping [1, 19, 20]. What is new about our work is that, if this requirement is satisfied for each of the hook methods, then the "plugging-in" process we outlined above will allow us to arrive at the richer behavior that tQ acquires as a result of the (re-)definition of h{). Since the entire notion of 00 frameworks rests on this ability to enrich, in the application, the common behavior provided by the framework, it is essential that the reasoning system enable us to reason about this enriched behavior. 1.1 Related Work The importance of proper documentation of frameworks has been widely recognized and a number of ways [2, 5, 9, 11, 13, 15, 18, 23, 24, 28] have been developed for the purpose. Some of these, for example, [2, 5, 18, 23, 24], are informal and focus on training application developers to use the framework in ways intended by the framework designer using cookbook-style patterns. Others are more formal and focus on developing techniques for specifying and reasoning about the behavior of the framework and applications built using the framework. Since our goal is to develop a formal reasoning method, we will restrict attention to this latter group. One important aspect of our work that distinguishes it from other approaches is that, we are interested in not just ensuring that the application meets the constraints imposed on it by the framework but also in ar- riving at the richer behavior resulting from the definitions (of the hook methods) adopted in the application, whereas most of the others focus on just the first point, i.e., ensuring that the application meets the framework's constraints. In their work on documentation of frameworks, Campbell and Islam [4] consider what might be called structural questions, such as identifying which objects are related to which other objects, specifying constraints on which classes may be combined with which other classes, etc., whereas our focus is on behavioral questions. But for complete documentation of complex frameworks and applications, we must deal with both structural and behavioral questions, so our approach is in some sense complementary to that of [4]. Helm et al. [13] introduce the notion of contracts to capture important aspects of the interactions between objects in a system. A contract imposes conditions on what calls to certain virtual- methods must do and under what conditions, whereas we will not impose such constraints. We will, as already outlined, require hook-method redefinitions to satisfy their framework-level specifications, but these are functional requirements whereas contracts allow one to impose requirements on, for example, what other methods the called hook method may, in turn, call. But [13] does not deal with the question of how to combine the specific behaviors implemented by the hook methods as defined in the application with the specification of the framework to obtain the behavior of the complete application. D'Souza's [7] Catalysis method focuses on the development of components and frameworks. It uses pre- and post-conditions as well as statecharts to specify transitions between key (abstract) states; external calls, including calls to hook methods, can be encoded in these statecharts. Although Catalysis provides a rich set of notations and tools to express behavior, it is not clear how it deals with the question of ensuring that the application developer abides by the framework's requirements nor how to plug-in the extra behavior provided by the application, to arrive at the application-specific behavior. Froehlich et al. [9] use the term hook to denote some aspect of the framework that can be used to tailor the system to the application's needs. A hook method is a simple hook, but it is possible to have more complex hooks as well. Suppose, for example, two hook methods are intended to cooperate in some manner and if one of them is redefined in the application then the other one would also (very likely) have to be redefined in order to be true to the intent of the framework designer; these two methods will together be considered a hook. Froehlich et al. propose a formal syntactic notation for specifying such hooks but do not deal formally with questions of behavior. Buchi and Week's [3] work is closer to our approach. They note that pre- and post-conditions on just the values of member variables are inadequate when dealing with ^For concreteness, we occasionally use language-specific terminology although our reasoning technique applies to any class-based 00 language; thus 'viilual' is in C++; in Eiffel or Java, we would say 'non-final'. template methods. They introduce a formalism and a programming language-like notation using which some information about the trace of hook method calls can be specified. Although their use of traces is similar to ours, Buchi and Week focus only on specifying conditions that the trace must satisfy. They do not address the question of establishing the richer behavior resulting from the redefinitions of the hook methods in the application. Garlan et al. [12] develop a temporal-logic based approach to reasoning about systems involving implicit invocation. Calls to hook methods are essentially implicit invocations since the actual method invoked is defined in the derived class which may not even be implemented when the framework is designed. Conversely, event-based systems, the type of implicit invocation systems that [12] considers, can be rewritten using template methods and hook methods. Indeed, the approach of [12] requires us to reason about the dispatcher method which is responsible for handling the implicit invocations, and this method is analogous to standard template methods found in 00 frameworks. One difference with our approach is that our specifications are non-temporal and hence somewhat easier to work with; on the other hand, our specifications involve traces explicitly whereas those of [12] do not. A more important difference is that a key goal of our approach is to first reason about the framework independently of any application that may be built on it and then, when an application is built by (re-)defining the virtual methods of the various base classes of the framework, to arrive at the behavior of the entire application by plugging information about the methods defined in the application, into the specification of the framework. By contrast, the approach of Garlan et al. is more of a top-down approach, so to speak; that is, given a system S (consisting of all the methods in all the classes) and a specification for it, they partition S into a number of groups (each consisting of a number of methods), arrive at a suitable specification for each group, show that each group satisfies its specification, and show that together the specifications of the individual groups, imply the original specification of S. For frameworks, since many applications may be developed on a given framework, our approach of reasoning independently about the framework and then plugging in application-specific information when the application is built, would seem preferable. In an earlier paper [26], we considered a related problem. Our goal in that paper was to reason about a single base class and derived classes that may be defined from it. Although that problem is related to the problem of reasoning about frameworks and applications, there are also important differences. For example, in the case of frameworks, there are several different objects each with its own static type, and with its own dynamic type; by contrast, in the situation considered in [26], there is a single object whose static type is the base class in question and dynamic type is the particular derived class under consideration. Again, in the case of frameworks, the result of redefinitions of virtual methods in a derived class in the application is to enrich not only the other methods of the corresponding base class but also, and indeed more importandy, enrich the methods of the controller class of the framework. By contrast, in [26] the only enrichment resulting from redefinitions of virtual methods in a derived class is of other methods of the base class that invoke these virtual methods. Despite these differences, we were able, as in the current paper, to use a trace-based approach to address the problem considered in [26]. This should not be surprising; traces are a fairly powerful tool and have been used for reasoning about different types of systems ranging from sequential to concurrent to distributed and real-time systems. Indeed, a trace-based model also underlies the approach of Garlan et al. although traces do not explicitly appear in their reasoning technique. But "there is no free lunch", as the saying goes; for exploiting the power of traces, we pay a price in the form of somewhat more complex assertions in our specifications, as we will see in the case study. This complexity can be somewhat mitigated by defining suitable functions and predicates over traces for use in our specifications. We will return to this issue in the final section of the paper. The main contributions of this paper may be summarized as follows: - It identifies the problems involved in documenting precisely the behavior of frameworks and in reasoning about the behavior of applications built on frameworks. - It develops a reasoning technique to allow the framework designer to specify the framework, and to allow the application developer to incrementally arrive at the specification his application from that of the framework. - It illustrates the reasoning technique by applying it to a simple case study. The rest of the paper is organized as follows: In Section 2 we present a simple model of frameworks and applications. In Sections 3 and 4 we develop our approach to specifying and reasoning about the behavior of frameworks and applications. In Section 5 we apply the approach to a simple application. In the final section we reiterate the importance of precise reasoning about framework behavior, summarize our solution to the problem, and identify some directions for future work. 2 Model Of Frameworks We will say that a class A is refineable if at least one of its methods is virtual^. A class all of whose methods are ^Our refineable classes are very similar to abstract classes, the key difference being thai at least one of the methods of an abstract class must be pure virtual {deferred in Eijfel), i.e., have no body defined for it in the class. In this case, the derived class would have to provide a definition for this method, else objects of this type cannot be created. Our approach deals wiih virtual and pure-virtual methods in very similar ways so we will generally not distinguish between the two. One language-specific point to note is that in C++, if a method is virtual (or pure virtual) in a given non-virtual is a concrete class. A framework T will consist of the following classes: - A concrete class C which we will also call the Controller class of T. C will have a distinguished method named Vm«()'. As the name suggests, it is this method that primarily decides how control flows among the various methods of the various classes of T. - Zero or more other concrete classes Ci,..., Cm- One or more refineable classes Ai,..., Anin addition to the runi) method, many frameworks include a mechanism for initialization in order to allow some information about the actual application to be passed to the framework. Although details of the initialization are important in practice, in this paper we will ignore them so that we can focus our attention on reasoning about the runQ method, the hook methods (of the refineable classes Al,..., An) that it invokes, and the possibilities that these provide to the application developer to enrich the behavior provided by the framework. An application A developed using T will have, corresponding to each refineable class Aj of !F, one or more derived classes CAj^h each of which will provide definitions for some or all of the virtual methods". AC will denote this additional code provided by the application builder. In order to use the application A, we would (in the 'client program' using A) create an object, let us call it d, of type C, initialize it, and then invoke run{)\ d.runQ Objects such as d will have components of type Ci,..Cm as well as ,..., (or rather CAi^i,...); the client program itself will not have any objects of these types.^ Inside the run{) method we will call the various methods defined in Ci,..., Cm and Ai,... ,An, some of which will be virtual methods, and hence will call the corresponding methods defined in the appropriate concrete classes defined in AC, to act upon these components of the main object d. In this manner, the application will exploit the common framework design embedded in the run{) method, as well as any other methods defined in J", while at the same time exploiting the application-specific behavior defined in AC. We conclude this section with an observation. The main contribution that a framework makes is relieving the application developer of having to worry about how con- class, it remains virtual in derived classes; our intention is that all virtual methods will become, in yava-terminoIogy,>ifl/ized in the applications. We should also note that we are taking a bit of liberty with the language in using 'refineable' to mean 'capable of refinement'. ■■We should note that CAj^i, for example, might be the same as Aj (if Aj does not contain any pure virtual methods); for the sake of uniformity of notation, we will still refer to it as CAj^i in the application. ^This seems to be the standard pattern for using applications built on frameworks. In view of this, it may be useful for programming languages to provide syntactic mechanisms that 'bundle' the classes of into a syntactic unit, identify the controller class C of etc. That will make it clear that the 'client' program is only expected to have objects of type C, and that classes Ci ..., Cm etc. are of no direct relevance to the client. trol should flow among the various methods of the various classes. In our model, this is highlighted by the run{) method. None of the other methods of Ai is allowed to invoke the hook methods because if they were allowed to do so, that would amount to embedding some of the control flow logic into these other methods. Not every framework fits within our model. For example, it is possible that the control flow is in fact distributed among several methods, perhaps even methods of several classes. Nevertheless, our model captures the essential aspects of frameworks. The key ideas underlying the reasoning technique we develop in the next two sections focusing, as it does, on dealing with characterizing the control flow in the framework, apply also to frameworks that do not fit exactly within our model. There is another, somewhat technical, restriction we impose on our frameworks: All the objects in our frameworks are named, i.e., there are no anonymous objects. In practice, frameworks often do have anonymous objects that are accessed via pointers or other references to them. In the final section we will briefly consider how we can use our approach to reason about such frameworks (and applications). 3 Reasoning About The Framework In order to specify the framework, we need to provide specifications for each of the concrete classes Ci,..., Cm, each of the refineable classes Ai,.. .,Am, and the controller class C. Since our approach is an extension of the standard approach, see for example [17], to reasoning about classes, we start with a brief summary of that: In reasoning about a class B, we first write down the specification of each method of B as pre- and post-conditions over the concrete state, i.e., the values of the member variables, of B, check (or formally verify) that the method body satisfies its concrete specification, define a conceptual model of B, write down the abstract specification of each method as pre- and post-conditions over the conceptual state, define a function (or relation) mapping the concrete state to the conceptual state, and finally check that if a method satisfies its concrete specification, then it must also, given our mapping from the concrete to the conceptual state, satisfy its abstract specification. Once this is done, the concrete specification is essentially forgotten since data abstraction, a basic tenet of the 00 approach, requires the client of B to take an abstract view of the class and not worry about its internal details. However, for a designer of a derived class D of B, the concrete specification of B is important since the methods he (re-)defines in D will in general access the member variables of the base class B and thereby interact with the methods defined in B, so it is important for this designer to know how these methods manipulate the member variables'" of B. 'Some authors, see for example [14], have argued that the derived class should have no direct access to the member variables of the base class but Consider the concrete class, Cj. The framework is a client of Ci. Hence, all we need is the conceptual specification of each method of Cj. This specification is not for use by the final client of the application since the class Ci is for use (only) by T. Rather, when dealing with a call of the form c(./() in the body of ci being a member variable of type Ci (of the controller class) of the framework and /() one of the methods of Ci, we will appeal to the conceptual specification of /() to understand the effect of this call. The situation is somewhat different for the re-fineable class Ai since the application developer will define one or more derived classes of Ai in which one or more of the hook methods of Ai will be (re-)defined. This means, following the comments in the last paragraph, we need the concrete specification of the behavior of the methods of A^. If ai is a member of the controller class C, and the static type of a/ is Ai, then in dealing with a call such as ai.h{) in the run{) method, h{) being a method of Ai, we will appeal to the concrete specification of hQ. Most interesting is the controller class C, in particular its run{) method. As noted earlier, it is this method that makes it possible to exploit the enriched behavior that the application developer implements in the application. For example, if the developer redefines the virtual method h{) in CAij and if the dynamic type of the object that ai refers to is CAij, then the call ai.h{) in run{) will be dispatched to this redefined method, CAij.h{). In order to allow the developer to reason about the resulting enriched behavior, we need to record information about this call in the specification of run{). For this purpose, we associate a trace r with nm{y and record all calls it makes to hook methods such as h{). What information about the call ai.h{) do we need to include in the trace r and how do we record it? This call (and the corresponding return) will be recorded by appending the following element to r: {ai, h, X, ai.v, xx, ai.vv) (1) The first component identifies the object involved, here ai. The second is the hook method called, here h{). The third gives us the values of the arguments passed to hQ. The fourth component is the state, i.e., the values of the member variables, of the object (ai) at the time of the call. The next component is the values of the arguments when h{) re- that instead, the base class should provide a special set of {protected) operations that would be available to the derived class but not to client code. These protected operations would permit the methods in the derived class to manipulate the variables of the base class but only in ways allowed by the protected operations. Even with this approach, we would still need to utilize two types of specifications; the first, an abstract specification, for use by clients of the (base) class; the second, a (more) concrete specification that includes information about the behavior of the protected operations, for use by the derived class designers. ^Recall that in our model, none of the other methods of Ai may invoke virtual methods such as /i() of /Ij. If this restriction were relaxed, the redefinition of h{) in CAi^j will result in this other method of Ai also being enriched since the call to h{) in this method will also be dispatched to the redefined method. To deal with such enrichments, we would have to associate a trace also with any such method of Ai and record all such calls to virtual methods. turns^ The final component is the state of the object when hi) returns. Let us consider each of these components in turn and see why it needs to be included. The first component, the identity of the object involved, is needed since otherwise we would not know which component of the framework will be enriched by the redefinition of in the application. The second component is needed since without knowing which method was invoked, we cannot even know whether any enrichment is involved. The next two components are needed since the enrichment in question may depend on these values. For example in the banking framework considered in Section 5, one of the virtual methods oi Account, one of the refineable classes of the framework, is withdraw{) ; in an application built on this framework, we may redefine withdrawQ to keep a count of withdrawals of more than a specified amount; if we want to be able to reason about the resulting enrichment of the runi) method, it is not sufficient to know which account objects withdrawQ was applied on, but also the amount withdrawn, in other words the argument value. Similarly, if the enrichment depends on the state of the object, say the current balance in the account, we will need the information in the fourth component to reason about the resulting enrichment of runQ's behavior. The information in the last two components is needed in some unusual cases. In particular, if the framework-level specification of h{) does not prescribe specific values (depending upon the values of x and ai.v) to be assigned to the various member variables or the arguments, but instead allows the method to assign a range of values, and if the subsequent behavior of runi) depends on these values, then in order to express the relation between this behavior and the state of the object (or the values of the arguments) at the time of the return from the initial virtual method call, we would need to refer to this state in the specification. The rule R1 below is a standard procedure call rule, modified to take account of recording of the call in the trace t: Rl. Hook Method Call Rule p => pre.Ai.hijj)[y x, I? ai.v] vv). { (p Apost.Ai.hijj)[y@pre x, v@pre ai.v, y xx, v ai.vv \ ) =5> ( {q[ai.v vv, X xx, ] ) [t t' {ai,h,x,ai.v,xx,vv)]) } { p } ai.h{x) { 5 } The first antecedent requires us to show that the precondition of h{) of the class Ai is satisfied with x playing the role of the formal parameter y and the state ai.v of the involved object ai playing the role of the member variables V of the class. '■(—' denotes substiturion of the indicated variables. In the second antecedent the '@pre' notation [27] ap- ®We assume, for simplicity, that all arguments are of simple types such as int and are passed by value-result. pearing in the post-condition of a method refers to the value, at the time the method was called, of the variable in question. Being able, in the post-condition, to refer to the values of variables at the time the method started, often simplifies the form of the post-conditions. " " " denotes appending an element to the trace. The second antecedent may be understood as follows: As far as the framework code is concerned, the effect of the call to h{) is essentially to assign new values to t (by appending to r an element representing the call), to ai.v (because of the new values that will in general be assigned by the body of h() to the member variables of ai), and to the arguments x (since our parameters are passed by value-result). This may be represented as three assignments; the first of these is: T := {ai,h,x,ai.v,xx,vv) (2) which appends, to r, an element representing the call; xx, w represent respectively the final values of the arguments and the final state of ai when the method finishes execution. This is followed by the two assignments: ai.v ■.= vv \ X := XX (3) representing respectively the effect of the call on the state of ai and on the arguments x. Thus the second antecedent of R1 checks that the pre-condition p and the postcondition q of this call are appropriately related, the substitutions to q reflecting the effect of the three assignments. The quantifiers on m, vv, and the post-condition (with appropriate substitutions) on the left side of the implication of this antecedent ensure that we consider all possible values (constrained by the requirement that they satisfy the conditions imposed by postAi.hQ) that the call may leave in the arguments and in the member variables of ai. It is worth noting that although R1 allows us to include complete information about each component of each element of T in the specification of run{), doing so may often result in unwieldy specifications. In practice, the framework designer must cull this information and include only those aspects that he expects will be of use to the application developer in building actual applications. In doing so, the designer does run the risk of leaving out some information that a particular application developer could have profitably exploited. We will return to this point later in the paper. 4 Plugging In The Application When an application is built on a framework, the most common type of enrichment is to add variables to one of the re-fineable classes Ai and redefine its virtual methods to update the new variables when these methods are invoked. For example in our banking framework considered in the next section, we will enrich the Account class of the framework by adding a member variable (Count which will keep count of the transactions performed on the account; correspondingly, the withdrawQ and deposit{) operations will be redefined so that they not only update the balance in the account but also increment tCount when they are invoked. The first step in reasoning about the application is to write down (and, if necessary, verify) the new specifications for the redefined methods. The details of this will, of course, depend on the details of the redefinitions. Specifications are also written down for the inherited methods; these are identical to their framework-level specifications with an additional clause in the post-condition that asserts that the values of the new member variables are the same as their values when the method started, i.e., a clause of the form: (nv = nv@pre) . (4) for each new variable nv. There is no need to verify these specifications since the methods have been verified to satisfy their original specifications and the new clauses of the form (4) must necessarily be satisfied since clearly the original methods could not have accessed variables such as nv. Next, we must check that the redefined methods also satisfy their framework-level specification, this being the behavioral subtyping requirement we discussed earlier. In other words, if /i() is one of the virtual methods of Ai that run{) invokes, then in reasoning about run{], using the rule R1 to deal with calls to /i(), we would have appealed to its framework-level specification. So when the developer redefines this method in a derived class C4j,j of Ai, unless the redefinition satisfies its framework-level specification, our reasoning would become unsound if the object on which h{) is invoked is of dynamic type CAi^j. In practice, this check is usually simple since the most common redefinitions of the hook methods are such that they do exactly what their framework-level bodies did as far as the member variables of Ai are concerned, and it is only these variables that can appear in the framework-level specification of hQ ; quite often, the redefined method is written as a call to the base class hi) followed or preceded by additional code to update only the new variables defined in CAjj-. In this case, since the code that manipulates the framework-level variables is the same as before, it follows immediately that the redefined method satisfies the con-esponding frameworklevel specification. The final, and perhaps most important, step is to arrive at the enriched behavior of the runi) method. Suppose the framework-level specification of mni) is: {pf hT = £] run(){qf] (5) where e is the empty trace (at its start, runi) has not made any hook method calls, so r must be empty); pj specifies information about the initial states of the objects that the framework manipulates; and qj the final states of these objects as well as the final value® of r. We will assume that the framework designer has established (5). 'We should note here that we are assuming that our nm{) method terminates. Many frameworks/applications are intended to run for ever. For dealing with such frameworks, we would have to use invariants; these invariants would be quite similar to our post-conditions and would be in terms of the states of the objects and the values of r. Suppose now we wish to arrive at the following application-level specification for run{): {Pa^T = e} run(){qa} (6) where Pa, like pf, specifies information about the initial states of the objects that the framework manipulates, but for each object, pa will include information also about the values of the new member variables introduced in the derived class (which is the dynamic type of the object in question) defined in the application, whereas pf would only contain information about the member variables of the corresponding base class defined in the framework. Similarly, qa would specify information about the complete final states of the various objects, including the values of the member variables introduced in the derived classes, whereas gy specifies only the 'framework portion' of the state of each object. Indeed, much of what we mean by enriched behavior is the information about the 'application portion' of the state of each object that Qa provides and qf does not. The key question is, how do we arrive at the enriched specification (6) by plugging the application-specific behavior of the hook methods into the framework-level specification (5)? We will first introduce a number of notations that will be of use in answering this question. Below, we will use a, a' etc. to denote the 'state of the application', i.e., the states of all the objects manipulated by the application; the current 'state of an object' is essentially the current values of all the member variables of the object; we will use c, c' etc. to refer to the individual objects. cr(c) : The state of the individual object c in the (application) state a. cr{c)1 : The 'framework portion' of the state of the object c. (t(c) I : 'Application portion' of the state of the object c. erf : The framework portion of the state a (i.e., for each object, we project out the framework portion of its state). (t4- : The application portion of a. |r| : Length of r, i.e., the number of elements in t. T[k] : element of r. T[k].ob : The first component of T[k], i.e., the identity of the object involved in this hook method call. T[k].hm : The second component of r[fc], i.e., the hook method called. T[k].ar : The third component of r[/c], i.e., the argument values passed to the method. T[k].re : The fifth component of r[fc], i.e., the results returned at the end of this call. r[fc].w : The fourth component of r[A;], i.e., the initial state of the object T[k].ob at the time of the call. T[k].fs : The sixth component of r[/c], i.e., the final state of the object T[k].ob at the time of return. T t : Sequence obtained from t by retaining only the framework portions of each of the two states in each element of t; similarly r J, is obtained by retaining only the application portion of the states. Consider now the question of going from (5) to (6). The relation that must hold between the respective preconditions Pa and P/ is easy to see: since, in the application, we wish to reuse the framework-level reasoning, we must have: Pa ^ Pf (7) The pre-conditions specify the effect of the initializations. (7) requires that the application-specific initialization ensure what has been assumed at the framework level. The relation between the post-conditions is more involved. Suppose we have states u, cr' that we expect are possible initial and corresponding final states of the application and r is the value that will be accumulated into the trace during the execution of /•««(), in going from a to a'. Then we note: 1. The framework portions of a, a', t must satisfy the framework-level specification (since otherwise they cannot arise according to that specification and hence they cannot also arise in the application since the redefinitions of the hook methods in the application are consistent, in the sense of behavioral subtyping, with their framework-level specifications). 2. The application portion of the state of each object does not change between hook method calls (since the framework-level code that executes between such calls cannot access this portion of the state). 3. Argument & result values, and initial & final states, recorded in any element of r must satisfy the application level post-condition of the hook method involved. It is the third item above that plays the critical role in letting us arrive at the enriched behavior; it allows us to appeal to the enriched application-specific behavior of the hook method and this is appropriate since it is to this method, depending on the run-time type of the object in question, that the call will be dispatched. Rule R2 below formalizes these observations. We first introduce two predicates: asu{a,r,a') : 'as«' denotes 'application portion of state is unchanged'. This evaluates to true if for each k (< |r|): the application portion of the 'initial state' recorded in T[k], i.e., r[fc] .isl, is the same as the application portion of the 'final state' recorded in the most recent preceding element of r that involves the same object as r[fc], or if there is no such preceding element then the state of this object in cr; and, for each object c, the application portion of this object in the final state, i.e., a'{c)l is the same as the application portion of the final state recorded in the last element of r that involves C (or is the same as cr{c)\. if c is not involved in any element of r). If either of these conditions is not satisfied, this predicate evaluates to false. sccs{t) : 'sees' stands for 'state change (in each element of r) is consistent with the (application-level) specification (of the method)'. This predicate evaluates to true if the arguments, results, initial, and final states recorded in r[/c] (for each k) satisfies the post-condition of the corresponding hook method in the appropriate derived class based on the dynamic type of the object in question, i.e., T[k].ob. Note that to evaluate sacs, we need to be able to determine the dynamic type of our objects. We will assume that such a facility is available. The formal definitions of these two functions is straightforward, if tedious, so we will omit them. R2. Enrichment Rule {pf At = e} T : run(){qf } Pa [ ba A qf[a@pre f- at, t = a2) A => [ [{T[k].hm = deposit) => {T[k].fs.tFee = T[k].is.tFee)] A [{T[k].hm = withdraw) ^ {T[k].fs.tFee = T[k].is.tFee -f 1)] ] (12) Finally, because of the enriched behavior of getlnfoQ in both TCAccount and FeeAccount, we can argue that the corresponding strings output in outlnfo will consist of the balance and transaction count information for the printlnfo requests corresponding to al, and the balance and transaction fee for the printlnfo requests corresponding to a2. Note that this depends upon the information in (10.4), that the strings making up outlnfo are based on the results returned by the calls to getlnfoQ, i.e., the .re components of the corresponding elements of r. [post.Bank.run' {) A (r[/c]./i7H = getlnfo) A JCCi()] => [(r[fc]./^ ^ T[k].is) A [{T[k].ob = al) => (r[fc].re = St ring {balance) ' "trans count" ~ T[k].fs.tCount)] A [{T[k].ob a2) {T[k].re ^ string{balance) ' "trans fees" " T[k].fs.tFee)] ] (13) I.e., the result returned by a call to getlnfo is a string consisting of the balance in the account followed by either the string "trans count" and the tCount in the account if the account in question is al, or the string "trans fees and the tFee in the account if the account in question is a2. When the information in (13) and in (10.4) are combined, we can conclude that the results that appear in outlnfo will indeed consist of the balance and transaction count information in the case of al and balance and transaction fee information in the case of a2, and that this information will accurately reflect the states of these account objects, based on the preceding transactions, at the time that the information was added to outlnfo. 6 Discussion A framework implements behavior that is common to a range of applications. An application developer can build a new application with just the incremental effort of redefining the hook methods of the framework as needed by the particular application. Our goal has been to design a reasoning approach that would be similarly incremental so that we need to reason just once about the framework. When a new application is built on the framework, our approach allows the developer to arrive at the behavior of the complete application by plugging the behavior of the redefined virtual methods into the specification of the framework. We had to address two issues in our work. The first was ensuring that the redefined methods satisfy their framework-level specifications. The second was establishing the enriched behavior resulting from the redefinition of hook methods. Much of the past work on reasoning about 00 programs has focussed on the first issue, this being the essence of behavioral subtyping. For frameworks, the second is also critical since enrichment of the behavior of the entire framework by just redefining the hook methods is the whole point of the framework based approach. In our work, the inclusion, in the specification of the runQ method, of information about the hook method calls it makes, enables the application developer to arrive at the application-specific behavior by appealing to the richer behavior of the redefined hook methods. In order to keep our formalism relatively simple, we had to impose some restrictions on our model of frameworks. One of these restrictions was that only runQ was al lowed to call hook methods. As noted earlier, we could relax this restriction but if we did so, we would have to associate a trace Tf with each method /() that may invoke hook methods. This is because, given that /() invokes hook methods, redefinitions of these hook methods in the application will result in the behavior of /() being enriched in the same manner as that of run{)-, therefore, to arrive at the application-specific behavior of /() incrementally, i.e., without reanalyzing its code, we would need information about the sequence of hook method calls it makes etc., in other words just the sort of information we record in the trace. Correspondingly, our enrichment rule would have to be generalized to apply to all template methods like /() rather than only run{). Another restriction in our model was that all the component objects of the framework/application were explicitly named, such as the accounts al and a2 in the case study. In practice, frameworks may have anonymous components with only references to them being stored, perhaps in an array. As long as there are no additional complications, such as aliasing between these various components, our approach needs only straightforward extensions to deal with such anonymous components. Aliasing presents more serious challenges as we note below. One important problem with our approach is the complexity of the specification of run{). Even in the case of our relatively simple case study, a full formal specification of run would have been quite involved. The complexity arises mainly because of having to deal with inductive manipulations of the trace elements. For example, if we wanted to formalize the clause (10.2), we would have to refer to all preceding transactions involving a given account when specifying the balance in the account following any particular transaction. This complexity is mostly notational and can be mitigated by devising suitable notations, perhaps using regular expressions, or notations similar to those of [3]. This should also make the approach more accessible to system developers who may not have much formal training but are likely to be comfortable with such notations. A more interesting problem, from a conceptual point of view, is that of framework composition. The question is how to extend the reasoning approach to deal with applications that are composed from several, rather than a single, framework. It should be relatively straightforward to extend the notion of our traces to deal with various hook method calls made by a framework composed from two or more frameworks. The more challenging question that will have to be addressed here is that of object references, i.e., one object containing a reference to another object. The problems in reasoning about such systems (even in the absence of hook methods) have been discussed for example, by Meyer [21], The problem is essentially that of aliasing but while aliasing has in the past been considered a sign of poor design, many modern 00 systems, including those built from frameworks, use object references to great advantage. Hence the reasoning system needs to provide appropriate ways to deal with object references. One approach to object references [6, 25] is to take explicit ac- count of object identity. While this approach works for a given, specific pattern of referencing between objects in a given system, it remains to be seen whether it can be extended to deal with flexible patterns of references such as are likely to be allowed in frameworks. Perhaps something like our trace of hook method calls, but one that allows us to record information about the details of these references, would prove useful. Acknowledgments: The authors would like to thank the anonymous referees for detailed comments on the first draft of the paper. References [1] p. America. Designing an object oriented programming language with behavioral subtyping. In Foundations of Object-Oriented Lxinguages, REX School/Workshop, U^CS 489, pages 69-90. SpringerVerlag, 1991. [2] K. Beck and R. lohnson. Patterns generate architectures. In Proceedings of the Eighth ECOOP, pages 139-149,1994. [3] M. Buchi and W. Week. The greybox approach: when blackbox specifications hide too much. Technical Report TUCS TR No. 297, Turku Centre for Computer Science, 1999. available at http://www.tucs.abo.fi/. [4] R.H. Campbell and N. Islam. A technique for documenting the framework of an OO system. Computing Systems, 6:363-389,1993. [5] Apple Computer. MacAppII Programmer's Guide. Apple Computer, 1989. [6] F.S. de Boer. Speaking of objects, technical report. Technical report, Utrecht University, The Netherlands, 1995. [7] D. D'Souza and A. Willis. Objects, Components, and Frameworks with UML. Addison Wesley, 1999. [8] M.E. Fayad and D.C. Schmidt. Special issue on object oriented application frameworks. Comm. of the ACM, 40, October 1997. [9] G. Froehlich, H. Hoover, L. Liu, and P. Sorenson. Hooking into 00 application frameworks. In Proc. of 1997Int. Conf onSofiw. Eng.,pages 141-151. ACM, 1997. [10] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design patterns: Elements of reusable 00 software. Addison-Wesley, 1995. [11] D. Gangopadhyay and S. Mitra. Understanding frameworks by exploration of exemplars. In Proc. of 1995 Int. Wkshp. on Computer Aided Softw. Eng, pages 90-99,1995. [12] D. Garlan, S. Jha, D. Notkin, and J. Dingel. Reasoning about implicit invocation. In Proc. of Foundations ofSofiw. Eng. (FSE-6), pages 209-221. ACM, 1998. [13] R. Helm, I. Holland, and D. Gangopadhyay. Contracts: Specifying behavioral compositions in object-oriented systems. In OOPSLA-ECOOP, pages 169180,1990. [14] C. Horstmann. Mastering Object-Oriented Design in C++. Wiley, 1995. [15] R. Johnson. Documenting frameworks using patterns. In Proc. of1992 OOPSLA, pages 63-76. ACM, 1992. [16] R. Johnson and B. Foote. Designing reusable classes. Journal of OOP, 1:26^9, 1988. [17] C. Jones. Systematic Software Development Using VDM. Prentice-Hall, 1990. [18] G. Krasner and S. Pope. A cookbook for using the mvc interface paradigm in smalltalk-80. J. of Object Oriented Programming, l(3):26-49,1988. [19] B. Liskov and J. Wing. A new definition of the subtype relation. In ECOOP, 1993. [20] B. Liskov and J. Wing. A behavioral notion of subtyping. ACM Trans, on Prog. Lang, and Systems, 16:1811-1841,1994. [21] B. Meyer. Object-Oriented Software Construction. Prentice Hall, 1997. [22] J. Misra and K. Chandy. Proofs of networks of processes. IEEE Trans, on Software Eng., 7:417-426, 1981. [23] W. Pree. Meta patterns: a means for capturing the essentials of reusable OO design. In Proc. of ECOOP, pages 150-162,1994. [24] H. Schmid. Creating architecture of a framework by design patterns. In Proc. of OOPSLA '95, pages 370384, 1995. [25] N. Soundarajan and S. Fridella. Reasoning about polymorphic behavior. In Ege, Singh, Meyer, Riehle, and Mitchell, editors, Proceedings of TOOLS 26, pages 346-358. IEEE Computer Society Press, 1998. [26] N. Soundarajan and S. Fridella. Framework-based applications: From incremental development to incremental reasoning. In W. Frakes, editor, Proc. of 6'-'^ Int. Conf on Softw. Reuse, LNCS 1844, pages 100116. Springer, 2000. [27] J. Warmer and A. Kleppe. The Object Constraint Langauge. Addison-Wesley, 1999. [28] D. Wilson and S. Wilson. Writing frameworks: capturing your expertise about a problem domain. In Tutorial notes, Eighth OOPSLA. ACM Press, 1993. Towards software design automation with patterns Ahmet Sikici and N. Yasemin Topaloglu Ege University, Computer Engineering Department Bornova, Izmir - 35100 Turkey. ahmets@egenet.com.tr, vasemin@bornova.ege.edu.tr Keywords: design patterns, software reuse, components Received: March 23, 2001 Design patterns are powerful design and reuse tools in software development. However current usage of patterns seem to employ a small portion of the enclosed potential. Commonly, pattern usage is limited to the manual customization of just a few legacy patterns, or documentation of existing designs. We think that a more effective use of the patterns requires standardization and formalization of the pattern utilization. In this work, we propose a formalism based on a fidly generalized and purely symbolic interpretation ofpatterns, which focuses on the representation of reusable knowledge. 1 Introduction The reuse of design and code has been a major issue of debates in software community, with an increasing popularity in the last decade. Most of these reuse activities take place in the ad-hoc reuse (Prieto-Diaz 1993) category. However, in order to maximize the benefits of reuse, as reduced time to market, better quality and decreased development costs; it is a well agreed fact that a systematic approach to reuse is essential. Systematic reuse has been named as a "paradigm shift" in software development involving reuse support in all the phases of the software development life cycle (Frakes 1994). One of the innovative approaches is the pattern-based reuse. In 1970's Christopher Alexander, an architect, introduced the pattern notion and proposed "a generative pattern language" which is composed of 250 patterns of architecture solutions (Alexander 1979). Alexander proposed to use combinations of a limited number of design patterns in order to build an unlimited number of solutions in the architecture domain. Recently, the software community has adopted the pattern concept (Coplien 1997) as design templates and since the book on Design Patterns, by GoF (Gamma et. al. 1994), the notion has gained wide popularity. According to the popular view, a design pattern does not contain any implementation and only introduces the key aspects of a design. In state of the art pattern definitions, natural language usage is dominant and automation of patterns is out of consideration. Although the state of the art design patterns provide a common language for the exchange of design solutions and a documentation tool, we believe that the pattern-based reuse can be more beneficial through the realization of a formal pattern model. In the literature, there are several projects on the formalization of patterns (Eden et. al. 1998, Smolarova et. al. 1998a), which aim to formalize the instantiation procedure of the pattern usage. In our opinion, formalization of design patterns can be applied in a broader context, including the pattern production processes. The purpose of our work is to provide a purely pattern-based model as a standard medium for the representation of reusable design knowledge. In this paper we present a general symbolism which is suitable for expressing any kind of design knowledge, without posing any semantic restriction or implication upon its content. Our discussion is based on a pattern model and a pattern arithmetic, which resemble object-oriented concepts and involve to define several operations like inheritance for design patterns. Prior to the introduction of our approach, we will attempt to orient our discussion with respect to the concepts and issues of the object-oriented paradigm and component-based technologies. For this purpose these concepts will need to be revised over abstract models and illustrated by simple examples. The paper is organized as follows. In Section 2, our pattern-based approach is compared with the object-oriented and component-based approaches conceptually. Section 3 covers the description of the pattern model and it is followed by the description of the pattern arithmetic in Section 4. Type checking phenomenon and the advanced message interpretation for the patterns is discussed in Section 5 and in Section 6, respectively. Conclusion is given in Section 7. 2 Patterns as Components in Reusable Software Development We claim that the pattern-based components can have certain advantages over their object-oriented counterparts in reusable software development. Before the discussion of the possible contributions of a pattern-orientation, the general concepts of component-based reuse will be reviewed. 2.1 Component-Based Reuse In component-based reuse, components are bound in a way that they can cooperate for performing the required ftjnctionality. Component technology requires the presence of certain descriptive measures to define how components can combine, and some protocol definition facility to define how they can communicate. In the rest of the paper, these two tasks of components will be called association and communication, respectively. In the literature, there are several definitions given for a component. Parrish et al. defines (Parrish et. al. 1999) a component as a software artifact consisting of a service interface, a client interface and an implementation. The service interface consists of the services that the component exports to the rest of the world, the client interface consists of those services used by this component exported from other components, and the implementation is the code necessary for the component to execute its intended fiinctionality. The sum of the service and the client interfaces makes up the interface of the component. Prieto-Diaz (Prieto-Diaz 1993), categorizes reuse from several aspects and according to this classification, reusing components as they are, is called black box reuse and reusing components with modification is called as white box reuse. It is our opinion that it can be more practical to use an analog value like the transparency of reuse instead of the bivalent parameter of being black or white. In this context, an ideal black box reuse can be considered to be totally opaque and so called white box reuse can be considered to be totally transparent. Components rely on a notion called wrapping which hides the implementation of the component and provides an interface for the component. The separation of the implementation and the interface is a crucial aspect of component-based reuse since it provides the necessary structure for the intended opacity. Figure-2. The abstract structure of a component. In Figure-2, an abstract view of a component is described. The piece of code that performs the wrapping and separates the interface from the implementation is named as the envelope. A component's envelope manages association with other components at coding time and communication at run time. These two are the two major component roles so it can be argued that the implementation section of a component has not been referred by the component systems, resulting in poor support for transparent reuse. However, black box reuse is well supported by these models, theoretically over all domains since any code that can be wrapped, can be reused. When patterns are considered, strict encapsulation is not feasible since transparency is encouraged and the distributed nature of patterns make them almost impossible to be encapsulated sufficiently by an envelope. Both the association and the communication of the pattern based components may require the implementafion part to be involved. This means that no part of a typical pattern-based component can be idealized as statically private. Instead, every piece of knowledge that a pattern encloses should be theoretically reusable. 2.2 Objects vs. Patterns as Components Parrish et al. (Parrish et. al. 1999) argue that it is disturbing to put objects and components into different categories from the beginning and that the relationships are strong and obvious. The facilities that the object-oriented paradigm grants to the component technology includes, support for the interfacing, encapsulation, easy deployment and privatization. Object-oriented type checking mechanism handles most of the component validation tasks and the communication between the components is carried out by the object-oriented messaging facility. The conceptual roots of the object-oriented paradigm and those of the design patterns are distinct. Object-oriented programming makes use of our daily acquaintance with physical objects. On the contra^', a pattern is the representation of an idea, which is something purely within the designer's imagination (Sikici & Topaloglu 2000). This fundamental difference between the objects and patterns have many practical implications. One immediate result is that new methods should be employed to perform type checking or validation for the patterns. While type of an object can be defined by means of the properties and methods it possesses, for patterns the solution is not that straightforward. The problem can be illustrated very well with a physical example; by the difference between defining what a car is and what a bridge is. A car can be easily visualized as an object, while a bridge is more like a pattern instance, so the type "car" can be implemented via a class but a pattern should be authored to provide a complete description of the notion "bridge". It is relatively easy to program a computer to recognize a car, as it can be defined analytically. A car is a vehicle with four wheels, an engine, a steering wheel, an exhaust pipe, two headlights etc. On the other hand, a bridge can be in many different forms. Our discussion is not limited to the fact that a pontoon bridge, an arc bridge and a suspension bridge have very little in common, in terms of physical appearance. We also claim that if some engineer comes up with a new type of bridge that does not look like any of the known forms, most people still can tell at the first sight that it is a bridge. Although patterns do not have clear-cut boundaries and handles, human mind can easily cope with them. The kind of intelligence that lets us identify a bridge, can not be carried out with conventional type checking methods. More complex pattern recognition methodologies are needed. The common habit prefers a different reuse medium at each level of granularity. It is argued that, unlike objects, which were too granular, components permitted reuse of implementation and related interfaces at medium granularity (D'Souza & Wills 1999). According to the current view about the patterns, pattern-based reuse takes place at an even higher level than the component-based reuse. Gamma et al. (Gamma et. al. 1994) state that design patterns describe commonly recurring structures of communicating components that solve general design problems within particular context. However there is no theoretical restriction for the employment of patterns at a finer-grained level as object and components can be shown to be special instances of the patterns. In (Sikici & Topaloglu 1999) objects and components are viewed as basic star-shaped patterns. However in order to reach a broad band reuse medium, in terms of granularity, the model needs to be kept free from semantic content. 3 The Pattern Model We base our discussion to pattern based components on a pattern model (Sikicr & Topaloglu 1999) which is composed of, nodes and links, each having a name and a set of attributes. Every pattern is represented with a standard notation, which can be described as a special purpose semantic net. The most significant attribute of nodes and links distinguishes each as either a hot or a cold element, indicating whether the structure could have further details at that point. In Figure-3, a simple notation to denote the four types of elements is shown. Node Cold Hot Node Name Pattern Type Link Link Name -> Pattern Type Figure - 3. Four types of elements of the pattern model Each hot element can be replaced by a whole pattern, allowing the nesting of patterns. A hot element can be visualized as a gap in the pattern specification waiting to be filled by another pattern. Hot elements have restriction criteria related to the type of the replacing pattern in order to ensure the safety of the replacement. The name of a cold node, just as the type of a hot node, can be considered to be the only nesting interface of a single node. Links have two additional properties within their nesting interface. Each link has two labels to match the nodes that it connects. These are named as the source label and the destination label as shown in Figure-4. Source Label O Destination Label Figure - 4. The association interface of a link. The source and the destination labels define a restriction on the relation provided by the link, by imposing type constraints on the nodes that are related. These labels make up the association interface of a link. Our pattern model is meant to provide a base for higher systems that can associate each node and link type with the necessary semantic content, by enhancing the basic mechanical behavior. 4 Pattern Arithmetic We consider several operations of the object-oriented technology like inheritance and aggregation as the elements of an object-oriented arithmetic. While automatic processing of the patterns is limited to the refinement operation, with the help of object-oriented arithmetic,. db]Qzls can be reused to build more complex objects or components. We claim that with the suitable set of operations, a formal pattern model can be even more orthogonal than object-oriented component systems. Below we discuss a set of operations that can be executed on patterns. Together with the pattern model and the type checking mechanism, these operations constitute a pattern arithmetic. 4.1 Nesting the Patterns Nesting is the most fundamental operation in the pattern arithmetic. Nesting is said to take place between two components when a component becomes a local property of the other. Although a component is nested inside another one through a single link, nesting a pattern into another pattern usually requires a lot of new bonds to be made. For this reason patterns naturally possess the potential to perform a stronger validation check, while attempting to nest one reuse entity into another. According to (Bosch 1997), since the designers of A. Sikici et al. reusable components are unable to make all but minimal assumptions about the context in which the component will operate, the binding of the acquaintances required by the component is often performed in an ad-hoc manner. On the other hand, patterns require that the nested pattern makes all of its association with its surrounding automatically and due to the structural representation of knowledge by the links that have been made, at least theoretically, no additional semantic contract is required. User Interface >c Data Output N/ Command System Input User Figure-5. System Monitor Pattern. In Figure-5, a basic pattern example is demonstrated by using the proposed pattern notation. This pattern is called System Monitor Pattern and it describes in a rather abstract way, how a system is monitored to a user. In the figure, the elements-, "User Interface", "Data" and "System" are arranged as hot elements indicating that the details related to these parts are not known. The overall structure however is known and it indicates that the System directly or indirectly sends Data to the User Interface while it receives commands from it. The User's interaction with the User Interface has been denoted with cold elements, indicating a static zone in the architecture. Char Data Command Y Image Char Input Figure-6. Basic Terminal Interface Pattern. In Figure-6, a pattern called Basic Terminal Interface is given. This pattern defines a basic character-based user interface like that of a dummy terminal. If the type checking system verifies that Basic Terminal Interface Pattern, is an implementation of User Interface Pattern, we can nest it into the position of the hot User Interface node in Figure-5. The problem of performing type checking on patterns will be discussed in the next section. Figure-7. The resultant pattern after the nesting operation. In Figure-7 the resultant pattern produced by the nesting operation can be observed. This new pattern describes a system monitoring pattern where the user interface accepts basic character-based input through the keyboard and displays character-based data from the system. 4.2 Abstraction of a Pattern Sometimes it is beneficial to decrease a pattern's determinism. Biichi and Sekerinski (Buchi & Sekerinski 1997) state that non-determinism is a fundamental tool for specifications to avoid laying down unnecessary details. It is also usefiil in providing flexibility and purifying the reusable knowledge. To achieve this we employ an operation called abstraction in accordance with the conceptual framework defined by Smolarova et al. (Smolarova et. al. 1998a, 1998b). The description of the abstraction operation is rather straightforward since physically the reverse of the nesting operation is performed. While in the nesting operation, a hot node is replaced by a pattern, in the abstraction operation a zone is replaced by a hot node. The operation is performed in two steps. In the first step a zone is marked over a pattern and in the second step that zone collapses into a single hot node. Zone P w z \/ w Figure-8. The abstraction operation. In Figure-8 the operation has been illustrated on a plain example. A zone P is chosen over a pattern and all the details that are enclosed by that zone are cleared by the operation. The semantic content of the zone is now represented by a single identifier P, which is a pattern type as shown in Figure-3. Under normal conditions the abstraction operation causes a loss of knowledge, resulting in a pattern that is not only more abstract but also more general. 4.3 Inheriting from a Pattern In order to reach an understanding of inheritance for patterns, we first need a formal definition of the concept of inheritance. The conceptual definition of inheritance is very similar to the definition of reuse itself According to (Week 1997), in principle, inheritance is equivalent to copying and modifying source code. Obviously this definition aims inheriting from code but it can be generalized and strengthened to fit our definition of pattern inheritance. In our view, the most suitable definition of inheritance is that, it is the act of basing the construction of a new entity onto an existing parent entity, and preserving all the key aspects related to the parent entity. From the pattern-oriented point of view, conceptually, the key aspect to be preserved is the knowledge content of the pattern. If we declare that a pattern inherits from another, this means that all the knowledge contained in the parent pattern is also possessed by the child pattern. Physical interpretation of this definition is that, when inheriting from a pattern, all the existing nodes and links should be preserved. In Figure-9, a new pattern, which was created by extending the Basic Terminal Interface Pattern is depicted. In this pattern, keyboard input has been preserved, and a mouse input has been added to the system. which is imposed by higher systems and the physical key aspects that are preserved in the inheritance operation. 4.4 Superposition of Patterns Superposition is probably best described in Christopher Alexander's words : "It is also possible to put patterns together in such a way that many patterns overlap in the same physical space: the building is very dense; it has many meanings captured in a small space; and through this density, it becomes profound" (Alexander et. al. 1977). We define superposition as a union of all the nodes and links of the two patterns. In this operation, two patterns are joined in a single scope, sharing their common entities and relations. Recurring elements are not repeated, but are re-referenced in the resultant Move&Click Figure-9. Extended Basic Terminal Interface Pattern. We need to mention that inheritance does not always produce a more specific pattern in the semantic sense. In fact inheritance may change the generality and abstractness of a pattern in either direction. The semantic effect of inheritance depends on the semantic context Figure-10. The superposition of patterns. Figure-10 illustrates the superposition operation in an abstract way. Superposition is suitable when two patterns are physically different but are known to be describing the same phenomenon. In that case, two loose descriptions of the same design can be superposed into one solid description. Superposition can also be viewed as multiple inheritance for the patterns and it can be used to combine two ideas into one. In contrast with object-oriented multiple inheritance, superposition operation does not aim to produce specific instances of the operands. In parallelism with the pattern inheritance, preserving the enclosed knowledge is the main focus of superposition. 5 Type Checking 5.1 Component-Based Validation Check In the idealized definitions of a component, a validation check is promised. For example Kozaczynski (Kozaczynski 1994) states that a component conforms to and provides a set of interfaces in the context of well- formed system architecture. However, the situation of the existing component systems in terms of integrity validation seems to be quite problematic. (Schreyjak 1997) states that the replacement and addition of components are not supported in an adequate way by today's component systems due to the lack of the integrity checks, that can verify that all the required interfaces and preconditions are met. Two innovative attempts to solve the validation problem of component usage will be introduced briefly. One is called a "port and link framework" (Wang & MacLean 1999) and the other is a "reuse contract" (DeHondt et. al. 1997). A "port and link" framework is proposed to support component assembly for distributed systems in (Wang & MacLean 1999). In this model, a functional boundary is described for the component by four lists, namely services a component provides, services a component requires, events a component observes and events a component generates. In Figure-11, the knowledge content of such a definition has been illustrated. This model clearly extends the classical approach with event generating and observing. The local services and generated events are called ports and they correspond to the service interface. Observed components and called services are called as links and they correspond to the client interface. In such a system, validation of the association of components is implemented by matching the links of one component with the ports of the other and vice versa. component boundaries, providing an overview of the component's interactions. Figure-12 depicts a representation of a reuse contract. Component 1 Component 2 Figure-12. A reuse contract. Since modifying or extending the system would probably break some of the existing links and construct new ones, some "reuse operators" were introduced to keep track of the modifications. Each "derived contract" is represented by a series of reuse operators and the association of the components across the derived contracts are validated by a consistency check between those series. A reuse contract can be considered to be an instance of an architectural pattern and seems to be useful in its domain. However, the need to introduce the reuse operators suggests that the issue of extending the reuse contracts is problematic. It is an obstacle that a common base contract is a minimum requirement for checking the validity of the association of two components. In our opinion such a two phased method with its descriptive and procedural parts is not as promising as a purely descriptive type checking method. 5.2 Pattern-Based Validation Check According to our approach, theoretically a separate structural type checking method can be defined for any semantic aim that can be attributed to type checking. If association is considered as in the case of component models, naturally type checking should focus on the external links of the pattern. The same is true for the nesting operation. A pattern's type according to association and nesting is defined by listing the incoming and outgoing links of the pattern. Interface: +a, +b, -c, -d Figure-11. Content of an architectural component definition. However (DeHondt et. al. 1997) demonstrates the insufficiency of a port and link approach and supports a "reuse contract" for the adequate definition of a component interface. A reuse contract keeps a track of the services that call each other within and across the Figure-13. Natural association and nesting interface. As indicated in Figure-13, the pattern can be considered as if it has been totally abstracted into a hot node, and its interface can be defined just as in the port and link method for components. Regardless of the implementation of the Pattern X, if its elements have the required external relations, it belongs to the equivalence class defined by the interface. This interface is called the natural association and nesting interface because it can be overridden at a higher layer in order to load additional semantic meaning to the pattern's type. This can be possible by defining a pattern interface (Sikici & Topaloglu 1999) explicitly. In that case, instead of matching the external links of the pattern with the links of the context, an explicitly specified portion of the pattern is tried to match with the context. p : Y i-z i Pattern X d -> Figure-14. Explicit association and nesting interface. In Figure-14, a basic example of an explicit association and nesting interface has been defined. The pattern that is shown in Figure-14 inherits from that of Figure-13. This time however an additional constraint should be met for a nesting or associafion to be possible. The source of the link "b" is linked to the target of link "c" with a relation "p". Only if such a relation is present within the context, a nesting of the pattern X into that context is allowed. 6 Advanced Message Interpretation As it has been stated in Section 3.1, communication is one of the two main issues that component systems have to deal with. Most of the state of the art component systems rely heavily on object-oriented message sending mechanisms for communication. Little has been added by the component technologies on this issue. Communication from an object-oriented point of view is called "message sending" as objects are known to communicate by sending messages to each other. An object-oriented message is an invocation of an operation. It consists of the name of the operation, the identity of the recipient, and a set of arguments (D'Souza & Wills 1999). An object's method or operation is simply a procedure that has been defined with respect to that object's scope and a message is a procedure call. Below we give a representation of a message sending syntax in pseudo-code. myRecipient.myOperation(argl,arg2, ...,argN) code for the operations the object "understands", that is, the operations for which there are specifications, and hence clients could expect to send it (D'Souza & Wills 1999). The notions of object-oriented programming like polymorphism, overriding and overloading play the chief roles in the interpretation of the message. The tasks that are attributed to these phenomena, can be summarized as the smart matching process of the incoming messages with the defined operations, by using the recipient's name, the operation's name and the list of arguments. Sometimes due to a wrapper object or a layer, the operation name or the order of arguments may be changed before the message is transferred to the private part of the component. An example for this case is a system called LayOm (Bosch 1997), which employs layers around classical objects in order to interpret messages. Further interpretation mechanisms are used by the components of distributed architectures where the main initiatives are providing compatibility and preserving security. Pattern-based component model brings about the idea that a more intelligent message interpretation can be possible by using the knowledge enclosed by a pattern. One method takes the advantage of a parallelism between a syntax diagram and a pattern. The method that will be introduced here is based on the fact that, through a series of knowledge-preserving transformations, pattern graphs can be transformed into syntax diagrams (Sikici & Topaloglu 1999). Inherits 2. Inherits 2.; 2.: is transformed into: Start 2.. > /Inh/ 2.1 n > /Inh/ 2.: - End. Figure 15. Transforming a pattern into a syntax diagram. Figure-15 depicts such a transformation. The possibility of such a transformation shows that patterns can parse complicated messages that have a structural form, just like the sentences of a language. Each pattern can be seen as defining a little language through which it interprets the incoming messages. Object-oriented components interpret messages in a rather straightforward way. The class definition contains Figure-16. A pattern-based message. In Figure-16 such a complicated message is represented, A pattern-based message is a train of tokens (Sikici & Topaloglu 2000). While it is being parsed, each token can be received by a separate node or link of the pattern. The receipt of each token by an element of the pattern is an equivalent of an object-oriented message transfer. The pattern's structure, and the message's content mutually determine a scenario that defines in what order the messages will be received by the nodes. 7 Conclusions In this paper we have discussed the benefits and plausibility of using patterns as reusable components. We outlined an approach for defining the pattern-based components relying on a formal pattern model. The model does not deal with the semantic content of the patterns but it supports pattern-originated concepts from a higher level, with constructs to define names, types, attributes, relations and non-determinism. Our model has been enhanced with a set of basic constructive operations, forming a pattern arithmetic. Pattern arithmetic enables to reuse the existing patterns in order to build new patterns. In our discussion we made use of the object-oriented and component-based phenomena and tried to inquire their equivalents in the pattern paradigm. Patterns are problematic in certain areas like black box reuse, however concepts like inheritance can be transferred to the pattern terminology with minor adaptations. In certain aspects like validation, patterns perform better than objects and components, due to their descriptiveness and transparency. Although at this point in our work, it is hard to make statements about the message interpretation issue, it is evident that, patterns are promising in the sense that semantically complex messages can be interpreted by the patterns. 8 References [1] Alexander C. (1979) The Timeless Way of Building. Oxford University Press, New York. [2] Alexander C., Ishikawa S., Silverstein M., Jacobson M., Fiksdahl-King I. & Angel,S. (1977) A Pattern Language. Oxford University Press, New York. [3] Bosch J. (1997) Adapting Object-Oriented Components. Proceedings of the Second International Workshop of Component-Oriented Programming (WCOP'97), Jyvaskyla, Finland, June, 1997. [4] Büchi M. & Sekerinski E. (1997) Formal Methods for Component Software: The Refinement Calculus Perspective. Second International Workshop of Component-Oriented Programming (WCOP'97), Jyvaskyla, Finland, June, 1997. [5] Coplien J. O. (1997) On the Nature of The Nature of Order. Lecture in The Chikako Patterns Group, 5"* August 1997 [notes taken by Brad Appleton]. http://www.enteract.coin/~bradapp/docs /NoNoO.htinl. [6] DeHondt K., Lucas C. & Steyaert P. (1997) Reuse Contracts as Component Interface Descriptions. Second International Workshop of Component-Oriented Programming (WCOP'97), Jyvaskyla, Finland, June, 1997. [7] D'Souza D. & Wills A. C. (1999) Objects, Components and Frameworks with UML. Addison Wesley. [8] Eden A., Gill J.Y., Hirshfeld Y. & Yehudai A. (1999) Towards a Mathematical Foundation for Design Patterns. Technical report 1999-004, Department of Information Technology, Uppsla University [9] Frakes W. (1994) Systematic Software Reuse: a paradigm shift. Proceedings of Third International Conference on Software Reuse, IEEE Computer Society Press, p 2-4. [10] Gamma E., Helm R., Johnson R. & Vlissides J. (1994) Design Patterns, Elements of Reusable Object Oriented Software. Addison Wesley. [ll]Kozaczynski W. (1999) Composite Nature of Component. 2'"' International Workshop on Component-Based Software Engineering, The 21" International Conference on Software Engineering (ICSE 99), Los Angeles, CA, USA, May 1999. [12]Parrish A., Dixon B. & Hale D. (1999) Component-Based Software Engineering: A Broad Based Model is Needed. 2"'' International Workshop on Component-Based Software Engineering, The 21'" International Conference on Software Engineering (ICSE 99), Los Angeles, CA, USA, May 1999. [13]Prieto-Diaz R. (1993) Status Report: Software Reusability. IEEE Software, May 1993, p 61-66. [14]Schreyjak S. (1997) Coupling of Workflow and Component-Oriented Systems. Second International Workshop of Component-Oriented Programming (WCOP'97), Jyvaskyla, Finland, June 1997. [15]Sikici A. & Topaloglu N.Y. (2000) Towards Building with Design Patterns. The Fifteenth International Symposium on Computer and Information Sciences(ISCIS Istanbul, Turkey, October, 2000, p 522-529. [16]Sikici A. & Topaloglu N.Y. (1999) Design Patterns as Building Blocks in Reusable Software Design. The Fourteenth International Symposium on Computer and Information Sciences(ISCIS XIV), Kujadasi, Turkey, 1999, p 247-256. [17]Smolàrovà M., Navrat P. & Bielikovà M. (1998a) Abstracting and Generalising with Design Patterns. The 13"' Int. Symposium on Computer and Information Sciences (ISCIS'98), U. Gudukbay et al. (Eds.) TOS Press, 1998, p 551-558. [18] Smolàrovà M., Navrat P. & Bielikovà M. (1998b) A Technique for Modelling Design Patterns. Knowledge-Based Software Engineering JCKBSE'98, Smolenice, Sept., 1998. P. Navrat and H. Ueno, eds, lOS Press, Amsterdam, p 89-97. [19] Wang G. & MacLean H.A. (1999) Software Components in Contexts and Service Negotiations. 2"'' International Workshop on Component-Based Software Engineering, The 2f' International Conference on Software Engineering (ICSE 99), Los Angeles, CA, USA, May 1999. [20] Week W. (1997) Inheritance Using Contracts & Object Composition. Second International Workshop of Component-Oriented Programming (WCOP'97), Jyvaskyla, Finland, June 1997. Support of knowledge management in distributed environment Jan Paralič*, Marek Paralič**, Marian Mach* *Dept. of Cybernetics and Artificial Intelligence, **Dept. of Computer Science, University of Technology in Košiće, Letnä 9, 042 00 Košiće, Slovakia Phone: +421 95 602 4128, Fax: +421 95 62 535 74 [paralic,paralicm,machm]@tuke.sk Keywords: knowledge management, knowledge modelling, mobile agents, agent collaboration, e-Democracy Received: March 11,2001 An original approach to support of knowledge management within an organization is presented in this article. This approach has been designed, implemented in form of the KnowWeb system and tested on various pilot applications. Special attention is paid to organizations with distributed environment. For this purpose an experimental system for support of mobile agents that combines the power of high-level distributed programming with the mobile agent paradigm has been proposed and is presented here as well. Finally, experiences from two of the KnowWeb pilot applications as well as further application possibilities in the area of e-Democracy are sketched. 1 Introduction Knowledge can be simply defined as actionable information (Tiwana 2000). That means (only) relevant information being available in the right place, at the right time, in the right context, and in the right way. The knowledge life cycle defined in (Nonaka & Takeuchi 1995) hinges on the distinction between tacit and explicit knowledge. Explicit knowledge is a formal one and can be found in documents of an organization: reports, articles, manuals, patents, pictures, images, video, sound, software etc. Tacit knowledge is personal knowledge given by individual experience. A considerable amount of explicit knowledge is scattered throughout various documents within organizations. It is . quite often that this knowledge is stored somewhere without being retrieved and reused any more. As a result, most knowledge is not shared and is forgotten in relatively short time after it has been invented or discovered. Therefore, it has become very important for advanced organizations to make the best use of information gathered from various document sources inside companies and from external sources like the Internet. On the other hand, tacit knowledge of the documents' authors provides important context to them, which cannot be effectively intercepted. Knowledge management (KM) generally deals with several activities that appear in knowledge life cycle (Abecker et al. 1998): identification, acquisition, development, dissemination (sharing), use and preservation of organization's knowledge. From these activities, dissemination (sharing) is crucial. Knowledge that does not flow does not grow and eventually ages and becomes obsolete and useless. By contrast, knowledge that flows, by being shared, acquired, and exchanged, generates new knowledge (Borghoff & Pareschi 1998). Our approach to knowledge management supports most of the activities mentioned above. Based on this approach, KnowWeb' toolkit has been designed, implemented and tested on 5 pilot applications. Firstly, it provides tools for capturing and updating of tacit knowledge connected with particular explicit knowledge inside documents. This is possible due to ontology, which is used for representation of organization's domain knowledge. Section 2 describes in more detail domain knowledge modelling in general and our approach to it in particular. ,, . ...... Secondly, intelligent .retrieval is enabled making use of both kinds of knowledge linked together within the organisational memory. How organisational memory in our approach is organised and what functionality it offers presents section 3. As next, efficient communication and support for distributed groups to share knowledge and exchange information efficiently is provided (section 4). For these purposes an experimental framework for mobile agents has been designed and implemented and is introduced in section 5. Experiences from two of the KnowWeb pilot applications as well as further application possibilities in the e-Democracy context are sketched in section 6. ' EC ftinded project ESPRIT 29065 "Web in Support of Knowledge Management in Company (KnowWeb)" J. Paralic et al. 2 Domain knowledge modelling 2,1 General Theoretical foundations for the research of domain modelling can be found in the works (Chandrasekaran et al. 1999; Gruber 1993; Wieünga et al. 1997), and others on ontologies and knowledge modelling. Ontology is a tenn borrowed from philosophy where it stands for a systematic theory of entities what exist. In context of knowledge modeling, Gruber introduced the term ontology as a set of definitions of content-specific knowledge representation primitives consisting of domain-dependent classes, relations, functions, and object constants. The ontology represents formal terms with which knowledge can be represented and individual problems within a particular domain can be described. Ontology in this sense determines what can 'exist' in a knowledge base. Chandrasekaran understands ontology interact at the knowledge level (Newell 1982). Ontology allows a group of people to agree on the meaning of few basic terms, from which many different individual instantiations, assertions and queries may be constructed. Once there is a consensus on understanding what particular 'words' mean, knowledge represented by these words can be adapted for particular purposes. Knowledge must be defined unambiguously because different people in the organisational structure of an organization need to use them with the same meaning. Thus, it is possible to re-use and share the knowledge thanks to understanding of its representation. Common understanding of the meaning of notions used in a given domain (the understanding may be domain-specific) results in the definition of concepts. Concepts are more or less abstract constructs on which a relevant part of the world is built, or better, which can be used to describe this relevant part of the world. Since concepts is. KnowWeb Ontology Buildei. Database veision 2.1 Ontology Edit View Search Help HMl 'iSIhIDI glä^xfi»! _ ^Huljgarji. Auaria t . ,1 ^f>1aara, s.r. Mecom-Trade ^ J wA Nelings 1 Silvia Miroslav Kacrriarf _ Prochajkova ^ Fdder / \ Proiect Characteristics. Depar: -- Curierit Number of Concepts : 150 Classes : 53 Instances: 97 Ijl-G Campaign ^ Š G Campaign Characteristics • É- G Campaign Tvpe è-G City É-G Company ; è -G Company Address ' I a-Q Instances ; : Š1-I Mazira, S.I.O. ; I [j... J AMNetings | i È I Mecom Trade j É-Q Attributes ; S O In Relations i i -Cä Out Relations j É-G Company Email É -'G Company Characteristics S-G CompanyStatus è -G Contact i ■ Ifi-G Date É-G Department a-G EDI è-G Email ; È-'G Family Name è-G Field ! g Cp Instances : i ! È-1 Electronic Commerce i i E'I Economics I I (fi-I Software Engineering \ ! É-1 Web sites developmenl : ' „j___ijUl Attributes ......'2 8.3.2000 9:27 Figure 1. Sample of a domain knowledge model (ontology). as a representation vocabulary typically specialised to some domain. He suggests basically two purposes why ontologies may be used: • to define most commonly used terms in a specific domain, thus building a skeleton, • to enable knowledge sharing and re-using both spatially and temporally - see also (Motta & Zdrahal 1998). can differ in their character, several types of concepts, namely classes, relations, functions or procedures, objects, variables or constants, can be distinguished. These primitive constructs can be represented differently in different applications but they have the same meaning in all applications - i.e. when someone wants to communicate with somebody else, he/she can do it using constructs from the ontology of shared concepts. Ontology with syntax and semantic rules provides the 'language' by which KnowWeb(-like) systems can The concepts create usually a very complicated hierarchical, network (or tree)-like structure. However, even a complex structure covers only a specific part of the world, e.g. a narrow world of an organization and its activities. This structure models the world from a certain point of view. And here emerges the notion from the title of this section - Domain Knowledge Modelling, as the concepts are usually highly domain-dependent or subject-dependent, and can be meaningfully used only in the frame of the particular domain. In other words, what is acceptable and important, for example, for a property management company may be not suitable for a company dealing with distance delivery of educational courses. 2.2 Our approach Based on the needs analysis of several pilot applications, two types of concepts have been identified as necessary and satisfactory as well. They can be either generic (type class) or specific (type instance). Both of them have attributes. Concepts and relations among them are used to construct domain model. Formally, a relation in KnowWeb is an oriented link between two concepts. Two basic types of relations can be distinguished: subclass_of for relations between classes and instance_of for relations between classes and their instances. These two relation types enable inheritance of attributes and their values. The inheritance is an important mechanism for the development of a hierarchical ontology. Also multiple inheritance is supported i.e. a class concept can inherit its attributes from several parent class concepts. Figure 1 represents an example - a very small part of a domain knowledge model. What shall be included in the domain model? The simple but vague answer is - everything what is relevant and important to describe a particular domain. In case of a company such model may conceptually describe the company specific concepts, such as its activities, projects, customers, employees etc., as well as relations among these concepts. Each organization has some knowledge already gathered in the form of various databases and/or documents containing information about various technologies, products, customers, suppliers, projects or employees. Each company has usually some internal procedures how to perform specific tasks. Simply said - knowledge exists in an established environment. This knowledge is traditionally called organization's goals and know-how. From the knowledge modelling perspective a repository of know-how, goals etc. may be addressed as^an organisational memoiy or a corporate memory. 3 Organisational memory (OM) 3.1 Conceptualisation and retrieval The KnowWeb system enables author of any document to store his/her background knowledge together with the document attaching the relevant concepts from ontology - i.e. document is stored with its context. Context can be attached to the document as a whole or to a specific part(s) of a document called text fragment(s). Text fragment is a continual part of text within the document (e.g. a sentence or paragraph). In present version, the KnowWeb toolkit is able to process MS Word documents where no restrictions are given on text fragments and HTML documents where text fragment cannot cross any HTML tag. In order to cope with these documents the system provides a set of tools. They differ in their functionality but together they enable documents' authors and users to manage knowledge in a company in an easy and user-friendly way. First, in order to place a document in the organisational memory, it is necessary to attach context knowledge (i.e. a piece of tacit knowledge) to it. This context can be in the form of a conceptual description (CD). By conceptual description is meant a set of links between a document (or its marked text fragments) and concepts in the domain knowledge model. Conceptual descriptions will enable to refer not only to explicit knowledge contained in the document but also to make use of tacit knowledge. In such a way easy sharing of knowledge in the future is enabled. The CD links can be created manually or semi-automatically User can select a text fragment and link this fragment to the domain knowledge model. The linking can be done directly to ontology concepts or to some template (for semi-automatic linking). Association links are of many-to-many type. It means that it is legal to link a document (or a text fragment) to several concepts and obviously, a concept can be linked to several documents (and or text fragments). When a document with its description is available (after manual or semi-automatic linking within the KnowWeb system or after receiving the document with its description from outside and consecutive automatic linking), it can be incorporated into the organisational memory represented by a KnowWeb server. The description of this document is stored in the KnowWeb server as well. Another possibility is to store a conceptual description of a document without storing the document (the document can be located on other KnowWeb server in a distributed company or somewhere else, e.g. on Internet). If an author is not satisfied with the conceptual description of a document stored in the organisational memory, he/she has the possibility to modify it and subsequently upload the document with the modified description into the organisational memory. The aim of storing document in the organisational memory is to access the right knowledge in the right time or situation. In order to express requirements on documents, which should be retrieved from the organisational memory, the user has to formulate a query. To formulate a query he/she can use concepts (or their attributes) from the domain model. They can be composed into a more complex structure using various operators (e.g. logical connections). In general, the concepts specified in the query will be used to search conceptual descriptions of documents. 3.2 Implementation The Conceptualisation tool {CT tool) is built on the top of three modules, namely DocView, OntoView, and TemplateEditor (see Figure 2). The CT tool serves as an "envelope" or integrator for these modules. The CT tool is able to deal with two types of documents. First of all, it is possible to open a new document, which will be linked to the corresponding concepts in the domain knowledge model, and a copy of this document will be stored on the Document Text Fragment Queries Window are provided and serve as references to such documents. For this kind of documents, it is possible to define only the association links between a document as a whole and (a) concept(s) from domain knowledge model. The same applies for other then MS Word and HTML types of documents, where text fragments cannot be defined. In order to create association link(s), the user uses the drag-and-drop functionality of the CT tool. It means that it is possible to "grab" a concept from the domain model. Mi^ia-i mwi i'[à l'I G Email G Family Name G Fiisl Name •G gbb B -O Instances ; è-I Critical ■ ^ I ^ ..... * I L U All b e ClI InRe: OF Ou F ■G industi)! i G Name i G New Con' a Number ' _I KW-Guide v2doc rfl CD Header of the Document i-C) Associated Concepts O TextFràgments B cp- Project KnowWeb 1 Associated Concepts i i-GPioiect ! '- G Company É-Ci^. Company è C-1 Associated Corwepts 0 Company Before we start developing the first domain model containing the acti relevant for our company we [should keep in mind the following points: Start the development of the domain model from a few important con modelling we are calling them 'seeds'. Continuously the seed will gro will appear and the model will become more complicated. ♦ Do not attempt to develop your domain model 'in one shot' - it is pr Make small steps rather than giant leaps. „ ♦ Begin your work with a paper notebook and pencil; to find the c< included in the model se ed you do not need any sophisticated tools. ♦ If you have no idea how to begin, you may find the sample discussec to play the role of your first 'seed'. Try to identify general concepts with more and more specific. Let us start with the clarification what we want to do actually..In oui^ develop a preliminary model of the activities that are relevant to the Technical University. We want to focus our domain model on the project-'ö goal drastically limits the space of relevant concepts - for instance, » Bf51:3|«l;;; I • - ' •• : it Figure 2. KnowWeb client interface to organizational memory. local KnowWeb server. Two kinds of association links are available for this kind of documents. • Association links between the whole document and (a) specific concept(s) from the domain knowledge model • Association links between a text fragment of the document and one or more concepts from the domain knowledge model The same may be done for documents that are already stored on the local KnowWeb server and can be already (at least partly) linked to some concepts from the domain model. Existing association links can be edited/removed and/or new links can be added. Another sort of documents are those, which will not be stored locally (e.g. documents accessed by remote retrieval llinction) and in principle can be located on any web server on Internet. We refer to the documents of this kind as referenced documents, because only their URLs drag and drop it on a highlighted text fragment or header of a document currently opened in DocView window. The purpose of the DocView module is to provide users with possibility to preview documents with option of highlighting those text fragments, which are linked to the domain knowledge model. It also supports definitions and modifications to the annotation structure of the document, which results in the modified definitions of text fragments. Assignment of specific attributes to these text fragments is also possible. The analysis of pilot applications by our industrial partners (e.g. application in the retail sector) identified a special requirement for "automatic" conceptualisation of documents with rigid structure (e.g. a structure of daily reports in a retail chain generated each day in each shop is fixed). This requirement is fulfilled by so-called "quiet mode" of the CT tool. In this case no visual component is Started and the documents are linked automatically to predefined concepts. TemplateEditor module is a good supporting tool for both "quiet" as well as manual modes of CT tool. The purpose of this module is to give the users a tool for definition of special rules for selection of appropriate concepts. Linking a document (or its text fragment) to a template means that association links from this document (or text fragment) to all concepts resulted after application of particular template to present ontology will be created automatically. In general, all documents have (possibly empty) headers, which represent various properties of the documents. For example, they can contain information regarding the document name, date and time of its creation, authors of documents, comments, etc. The set of attributes applicable for a header definition is application-dependent. Some properties may be compulsory while quiet mode of the CT tool, i.e. for automatic linking of whole documents. In this particular case, the target concepts are given indirectly and depend on the document property values as given in the document header. Moreover, means for automatic creation of instances in the domain knowledge model (in quiet mode) are provided as well. This was another user-defined requirement. 4 Agent-based support for distributed organisational memory As already mentioned in previous sections one of the most important requirements for success of OM is a support for the distributed environment of an organization. The OM should be flexible enough to fit different organization network settings and also it should remain open to the external sources such as Internet. The Client Server KW 1'' Conceptualisation TocI Client ; DocView Onto View; Ifjeml., Communication module HTTP ,, Organizational Memory Domain ill m od el Asoc.' links Documents:! & Oritology, Server Front - End DCOM DOOM HTTP 71P ODBC . KW Server Database System Data Document Space Figure 3. Distributed architecture of the KnowWeb environment - intranet solution. others are optional. Within a template three basic kinds of concept references are available. • A concept can be directly referenced (only target concept must be given). • A concept can be a result of an "if - then" rule application to values of document properties. • A concept can be referenced by a document property. In this case, the name of a concept is determined by the value of a document property. The first,two kinds of references are especially useful for manual linking, but can be used for automatic linking as well. The last type is well suited for the purposes of the state-of-the-art solution to the first problem lies in the utilisation of the distributed objects. The most popular architectures for supporting distributed objects are CORBA (Common Object Request Broker Architecture) and DCOM (Distributed Component Object Model) (Pritchard 1999). Objects in these architectures can capture the high-level logic (so-called business logic) of a distributed application and are accessible for processes outside the computer running them. Present implementations of these standards are based on commonly used communication protocols (e.g. TCP/IP). In the research J. Paralic et al. prototype version of the KnowWeb toolkit a 3-tiered architecture was designed and implemented. We are distinguishing the following tiers: • data source tier represented by the relational database, • middle tier (called 'Server Front-End') that offers services to clients independently from the database implementation, and • KnowWeb clients. The relations among these three tiers are depicted in Figure 3. The document store and OM ftinctionality is mapped to the KnowWeb server. Server offers key services - (i) acquisition and modification of the domain knowledge model state, (ii) association links of the documents, and (iii) retrieval in the document or ontology (domain model) space. All these operations are at the implementation level carried out as operations upon a relational database. Solution as described above fulfils the requirement of flexibility in the corporate intranet settings and partially addresses the openness to the external sources. KnowWeb server is usually separated on an own local area network, contains (mostly one) domain knowledge model and typically serves many clients. The KnowWeb client can communicate with more than one KnowWeb server, potentially outside of the intranet scope. implemented. Mobile agent (MA) technology is perhaps the most promising paradigm that supports application design for dynamically changeable, networked environment with distributed information and computation resources. The most significant properties of mobile agents regarding their role are autonomy and mobility (Rothermel et al, 1997). MAs are autonomous because of their capability to decide which locations in a computer network they will visit and which actions they will take in these locations. This ability is embodied in the source code of every agent (implicit behaviour) or by the agent's itinerary, which can be dynamically modified (explicit 'orders'). Mobile agents can move between different locations in a network. A location is the basic environment for execution of mobile agents' code and therefore an abstraction of the underlying computer network and operating system. Usual benefits of mobile agents are (i) reduced network load, (ii) overcome of the network latency, (iii) encapsulation of different protocols, (iv) asynchronous and autonomous execution, and (v) natural heterogeneity (Harrison et al. 1995; Lange & Oshima 1999). We claim that robust and scaleable OM systems can profit from these features; this will be especially visible in the companies with a world-wide and/or distributed structure. Modified distributed structure of the KnowWeb system is depicted on Figure 4. Client Server S KnowWeb'Server S aAA Serves WAN Search & Fetch Agents Internet Communication MA 11 Figure 4. Mobile-agent based distributed structure of the KnowWeb system - supports also less reliable and low speed WANs However, the accessibility of KnowWeb servers outside the fast intranet networks assumes a reliable network connection with guaranteed throughput especially when users want to introduce new documents into OM and browse already introduced documents. Unfortunately, this is not the case with most portable computers (notebooks etc.) or organization branches connected through a low-speed dial-up network connection. To solve the major problem with the throughput the mobile agent-based solution has been proposed and Dedicated mobile agents can do the most critical operations such as retrieval and gathering of documents at non-intranet servers. In retrieval operation client can formulate the query locally and either send an appropriate searching agent directly or demand the sending of an agent by a local KnowWeb server. The situation with gathering an already introduced document (accessible through other KnowWeb server) is similar. The main advantage of this functionality is the possibility for a client to get off-line for the period of operation execution. By the next re-connect the results of ordered operations will be presented to the user. Such an approach brings significant savings to the communication between distributed KnowWeb servers containing a distributed OM. To enable the work with mobile agents the Mobile Agent Environment (MAE) must be available on each concerned computer (client or server). MAE offers the following functionality: (i) creation of a mobile agent with a unique identity, (ii) transport of an agent, (iii) sending a message to an agent (possibly on another host), and (iv) getting the status information about any agent. MAE used in the research prototype of the KnowWeb toolkit is described in the following section. 5 Experimental framework for mobile agents Basic functions of mobile agent environments (in today's mobile agent systems represented by agent servers) are identified by the Mobile Agent System Interoperability Facility (MASIF) (Milojcic et al. 1998) and include: (i) transferring an agent, which can include initiating an agent transfer, receiving an agent, and transferring classes, (ii) creating an agent, (iii) providing globally unique agent names, (iv) supporting the concept of a region, (v) finding a mobile agent, and (vi) ensuring a secure environment for agent operation. A MA-based application usually requires to be programmed in two separate parts: mobile agent and location like context in Aglets (Oshima et al. 1998), location services in Ara (Peine & Stolpmann 1997), place in Gypsy (Jazayeri & Lugmayr 2000) or service bridges in Concordia (Wong et al. 1997). Mobile agent has a predetermined interface and restricted sources to communicate with on each visited location. Mobile agent based distributed architecture of the KnowWeb system use the ESMA toolkit - developed and implemented at the Dept. of computers and informatics at the TU Košiće (Paralič M. 2000). ESMA -the experimental system for support of mobile agents combines the power of high-level distributed programming with the mobile agent paradigm. As implementation platform the Mozart system (The Mozart Programming System) was chosen. It implements the Distributed Oz (DOz) programming language and offers simultaneously the advantage of a True Distributed System and the means for building a Mobile Code System (Picco 1998). In the following subsections mobile agent environment built from servers and its services and the methodology of how to build an agent-based application in our framework are shortly introduced. 5.1 Mobile agent environment The . basic functions offered by a mobile agent environment are transport and management of agents. In today's agent systems these services are offered by servers, which must be installed and running on every host computer that should be available for the mobile agents. Similarly our experimental framework in DOz offers basic functionality mentioned above. MAE in the current implementation of the system can be started only once per host computer. Every agent created on a local MAE is a home agent for this MAE. On all other sites it will visit, it gains the status of a foreign agent. The MAE stores information about all its home agents and current available foreign agents in local database. The information about foreign agents is stored in the system only during time between the successful receiving from previous and successful sending an agent to the next host. For the programmer of a mobile agent based application, the mobile agent environment is represented by the MAE module, which must be imported. Importing a MAE functor causes either the launching of a new environment with initialization values from persistent database for home and foreign agents and already connected other MAE servers or getting a reference to an already started MAE local server. This process is realized not only by launching a new local mobile-agent based application but also by resuming every incoming mobile agent. Thus, an agent gets access to the key services of the mobile agent environment. The possibility of dynamic loading and installation of first-class modules (Duchier et al. 1998) is thereby very important. The communication between MAEs is realized in two layers: the first layer uses TCP sockets for exchanging of tickets for Oz ports. Oz ports then build a second, highlevel communication layer, which can take advantage of Oz space data transparency. The Oz space offers the possibility to transparent copying of stateless entities and creating references for worldwide unique stateful entities. These possibilities can be fully utilized especially by the inter-agent communication. 5.2 Mobile agent based applications Creating a MA-based application in the proposed framework is straightforward and requires only the following steps: 1. Identifying all fixed, not transferable resources needed by the application (i.e. their type) by means of abstract names and identifying parts of the transferable agent state. 2. Design and implementation of application-specific classes that are derived from the MobileAgent class and deal with agent state and other resources through their abstract names. 3. Designing and implementing an application that creates one or more instances of mobile agents, specifies their itinerary, sends them away, waits until they finish their jobs (or until the owner stops their work), and processes the results. 4. Design and installation of special environment modules (functors) in a compiled form. They map the abstract sources of MA to the real local resources of the host computer that should enable the execution of the MA based application. 5.3 KnowWeb and ESMA To make the distributed structure of the KnowWeb system more flexible and suitable not only for fast intranet networks two special agent classes were created. The first one is KW_SearchAgent and the other one is KW^FetchAgent. Both mobile agent classes were proposed and implemented according to the methodology in section 5.2. The main task of the KW_SearchAgent is to get a query from the KnowWeb client and walk according to the travel plan of the KW servers in order to get the list of relevant documents - i.e. their exact addresses. Based on this list a KW client can send one ore more agents from the class KWJFetchAgent, which gather the whole document and return back with it. At the client side are mobile agents created automatically after storing a query or fetching requirement in special form at local file system in predefined directory. At the server side KW_FetchAgent communicates directly with the WWW server that maintains the document space in the KnowWeb system. KW_SearchAgents store their queries at the local file system in predefined directory and are waiting for the answers, which can read also from the local file system. 6 Applications 6.1 KnowWeb pilot applications in BRD Botnia Retail Data Inc. (BRD), located in Finland, has developed two of the five existing pilot applications based on the KnowWeb system. Main business activity of the BRD group is software development and consultancy for retail and industry. Their main product is WINPOS® - A Point-of-Sale (POS) software package. In BRD felt that the best way to understand how to use the KnowWeb system is to first use the software within their organization with a domain knowledge model suitable for them. The second phase was the simulation of a retail environment with a domain knowledge model suitable for retail chains. The database being used was an SQL Server database running under Windows NT to which documents are stored when they are finalised. Each user, who belongs to the personnel of BRD, uses Windows NT workstations on which KnowWeb clients are operating. The documents entered by a user shall typically be linked to a group of contexts. In order to quickly find the right contexts where to link the documents they have created a number of individual concept conditions. They are handling mainly documents of the MS Word or HTML types. They often scan in news articles and other printed material, which is first, pasted into a MS Word document and attached with explanations before the document is stored into the KnowWeb space. The advantages identified in BRD after a couple of months of active KnowWeb using are the following. 1. Internal efficiency improved. They have been able to organise themselves much better than before. Now anyone in BRD can access relevant documents anytime without having to hunt for a document and wasting the time of his/her colleagues. 2. Faster customer response. Support personnel are now able to on-line check all information related to various WINPOS® versions and features. 3. Exact and broader feedback to development. They are tracing competitor information as well as feedback and feature requests from their end-users and dealers of their WINPOS® software product. Before they did not systematically trace this information. With the new system the development engineers are receiving much better information as a base for decisions about further enhancements to the product range. 4. Improved marketing. Through their systematic entering of competitors' information and their own marketing material as well, combined with the better feeling for what features customers are actually asking for, they have the feeling the system has helped them in improving their marketing activities. On the other hand the following disadvantages have been reported: 1. User interface of the KnowWeb system prototype version seems to be too complicated to use for ordinary users. 2. There is a time overhead in inserting the documents into the KnowWeb system. During a busy day this is simply not being done. 3. The success of the introduction of the KnowWeb system to a quite large extent depends on the discipline of the staff using it. Main motivation of the second BRD's pilot application was to structure all the events that within a retail chain cause disturbance to the present information system. These exceptions are in modern network environments most often reported in form of emails and documents sent between retail head office departments and the shops; today often without structure and forgotten after some time - thus leaving incorrect information in various databases as reports of present POS information systems. When planning a campaign or when making judgments e.g. regarding profitability on certain article groups, the marketing departments are often relying on historical information, which may go several years back. If the figures are unreliable, the decisions taken may be incorrect - the problem today is that many of the systems in the market today are not able to trace disturbances that have happened in the past. Having this in mind, in BRD have linked their WINPOS® Point-of-Sales software to the KnowWeb system within the second pilot application. The end-of- day routines in the WINPOS® Point-of-Sales package (normally run at the end of the day when the shop has closed) are in this pilot application automatically generating shop-specific html documents, which are automatically introduced to the KnowWeb space. This means that the html-reports are automatically inserted in the Know-Web database and automatically linked to some predefined concepts via templates. Moreover, certain date related information leads to dynamic generation of new instances in the domain model. As examples of documents being stored automatically we can mention for example shop-report, that is: sales, profit and payment media information summarised for a shop. Another example is the so-called department report, which contains sales amount, quantities and profit for the main article groups of the company. The advantages the customer will have from this system are: 1. Internal efficiency improved. 2. Faster customer response. 3. Improved marketing. 4. Sales and profit analysing related to disturbances. Automatically created and linked POS reports on one side are combined within the KnowWeb system with information about events causing various disturbances in retail chain (see above) providing realistic view on calculated numbers in context of occurred events. 6.2 Webocrat system Another, very interesting application domain is electronic democracy (Paralič & Sabol 2001). An architecture of a Web based system Webocrat^ is being designed with aim to empower citizens with innovative communication, access and voting system supporting increased participation of citizens, in democratic processes and increase transparency and accessibility of public administration (PA). Basic scheme of the proposed WEBOCRAT system is depicted in Figure 5. The system is based on knowledge modelling technology. Ontological knowledge models are employed in order to index all the information present in the system. Therefore first layer - knowledge model is depicted in the core of the scheme on Figure 5. From the technical point of view, the system will be based on results achieved within the KnowWeb project focused on organising information using domain knowledge models. Their employment enables precise annotation of information based on its content, which results in efficient and powerfiil information retrieval capability. ^ EC funded project lST-1999-20364 Webocracy (Web Technologies Supporting Direct Participation in Democratic Processes) Figure 5. The principal scheme of the Webocrat system. The second layer is represented by publishing, discussion and opinion polling spaces providing means for storing, updating and managing (in principle three slightly different types of) documents and their relations among themselves, as well as their relations to the knowledge model in the WEBOCRAT system. The Discussion Forum (DF) module will support intelligent communication processes between public authorities, citizens and their elected representatives. DF will be responsible for documents in discussion space. The Web Content Management (WCM) module will support publication of documents on the Internet, i.e. it will be responsible for documents in the publishing space. The Opinion Polling Room (OPR) will enable electronic opinion polling featuring also with support for authentication and voter's privacy. OPR will be responsible for documents in the op/«ion polling space. The third layer is composed from two retrieval-focused modules supporting retrieval capabilities that need only read access to the three above-mentioned spaces as well as to the knowledge model. The Information Desk (ID) will retrieve relevant documents of different types that are stored in the system. The Reporter/Summary (REP) module will provide means for calculation of a variety of statistics as well as some more sophisticated approaches based on, e.g. data mining techniques. Moreover, this module will provide also alerting fiinctionality, which has been required by user partners. User registered in the system as an individual entity (i.e. not anonymous user) is provided with a personal access page ensuring him/her an individual access to the system. This page is built in an automatic way and can contain several parts. Some of them can be general and the other are person specific. The former can serve as a starting point for browsing all published documents accessible to the user, all conferences he/she is allowed to participate in, all running polls for which he/she is eligible, using search facilities of the system, read hot information, etc. The latter parts are devoted to user's personal newsletter, links to documents and conferences topics of which match the user's area of interest. The personal access page hides division of the system into modules. Terms 'publishing space', 'discussion space', and 'opinion polling space' do not confuse users. The personal access page enables user to access all functionality of the system that he/she is allowed to access in a uniform and coherent way. 7 Acknowledgents This work is done within the Webocracy project, which is supported by European Commission DG INFSO under the 1ST programme, contract No. 1ST-1999-203 64 and project No. 1/8131/01 "Knowledge technologies for information acquisition and retrieval" supported by the Grant Agency of Ministry of Education and Academy of Science of Slovak Republic. The content of this publication is the sole responsibility of the authors, and in no way represents the view of the European Commission or its services. 8 References [1] Abecker A., Bernardi A. Hinkelmann K. Kühn, O. & Sintek M. (1998): Toward a Technology for Organizational Memories, IEEE Intelligent Systems, 13, May/June, p.40-48 [2] Borghoff U. M. & Pareschi R. Eds. (1998) Information Technology for Knowledge Management. Springer Verlag [3] Chandrasekaran B., Josephson J.R. & Benjamins V.R. (1999) What Are Ontologies and Why Do We Need Them. IEEE Intelligent Systems, 14, p. 20-26 [4] Duchier D., Kornstaedt L., Schulte Ch. & Smolka G. (1998) A Higher-order Module discipline with separate Compilation, Dynamic Linking, and Pickling, Technical Report, Programming Systems Lab, DFKI and Universität des Saarlandes [5] Gruber T.R. (1993) A Translation approach to Portable Ontology Specifications. Knowledge Acquisition, 5, 2 [6] Harrison C.G., Chess D.M. & Kershenbaum, A. (1995) Mobile Agents: Are they a good idea? Research Report, IBM Research Division [7] Jazayeri M. & Lugmayr W. (2000) Gypsy: A Component-based Mobile Agent System, Proceedings of the 8"' Euromicro PDP'2000, Rhodos, Greece [8] Lange, D.B. & Oshima, M. (1999) Seven Good Reasons for Mobile Agents. Communications of the ACM, 42, 3, p. 88-89 [9] Mach M., Sabol T., Paralič J. & Kende R. (2000): Knowledge Modelling in Support of Knowledge Management. Proceedings of the CRIS'2000 Conference, Espoo-Helsinki, Finland, p. 84-88. [10] Milojcic D., Breugst M., Busse 1., Campbell J., Covaci S., Friedman B., Kosaka K., Lange D., Ono K., Oshima M., Tham C., Virdhagriswaran S. & White J. (1998) MASIF: The OMG Mobile Agent System Interoperability Facility. Proc. of the Second International Workshop, Mobile Agents '98, Springer-Verlag, p. 50-67 [11] Motta E. & Zdrahal Z. (1998) A principled approach to the construction of a task-specific library of problem solving components. Proceedings of the 11 Banff Knowledge Acquisition for KBS Workshop, Canada [12] Newell A. (1982) The Knowledge Level. Artificial Intelligence, 18, p. 87-127 [13] Nonaka I. & Takeuchi H. (1995) The Knowledge Creating Company: How Japanese Companies Create the Dynamics of Innovation. Oxford Univ. Press [14] Oshima M., Karjoth G. & Ono K. (1998) Aglets Specification (1.1), IBM Corporation [15] Paralič M. (2000) Mobile Agents Based on Concurrent Constraint Programming. Lecture Notes in Computer Science, Vol. 1897, pp. 62-75, Springer Verlag. [16] Paralic, J. and Sabol, T. (2001) Implementation of e-Government Using Knowledge-Based System. Paper accepted for the 2"'' Int. DEXA Workshop "On the way to Electronic Government", Munich, September 2001 [17] Peine H. & Stolpmann T. (1997) The Architecture of Ara Platform for Mobile Agents. Proc. of the First International Workshop on Mobile Agents, MA '97, LNCS 1219, Springer Verlag [18] Picco G.P. (1998) Understanding, Evaluating, Formalizing, and Exploiting Code Mobility, PhD thesis. Dipartimento di Automatica e Informatica, Politecnico di Torino, Italy [19] Pritchard J. (1999) COM and CORBA Side by Side: Architectures, Strategies, and Implementations. Addison-Wesley [20] Rothermel K., Hohl F. & Radouniklis N. (1997) Mobile Agent Systems: What is Missing? Proc. of International Working Conference on Distributed Application and Interoperable Systems DAIS'97 [21] The Mozart Programming System, Available at http://w\vw.mozart-oz.org/ [22] Tiwana A. (2000) The Knowledge Management Toolkit. Prentice Hall [23] van Heijst G., Schreiber A.T. & Wielinga, B.J. (1997) Using explicit ontologies in KBS development. Int. Journal of Human-Computer Studies, 46, p. 183-292 [24] Wong D., Paciorek N., Walsh T., DiCelie J., Young M. & Peet B. 0997) Concordia: An Infrastructure for Collaborating Mobile Agents, Proc. of the First International Workshop on Mobile Agents, MA'97, LNCS 1219, Springer-Verlag An efficient approach to extracting and ranking the top K interesting target ranks from Web search engines Chien-l Lee Institute of Information Education National Tainan Teachers College, Tainan, Taiwan, ROC leeci@ipx.ntntc.edu.tw Cheng-Jung Tsai Dept. of Computer & Information Science National Chiao Tung University, Hsinchu, Taiwan, ROC tasicj.cis89g@nctu.edu.tw Keywords: search engine, term weighting function, similarity function; ranking; Boolean expression Received: April 2, 2001 Recently, several search engines have been established to help people find interesting information among the rapidly increasing number of web pages over the Internet. To obtain useful and reasonable searching results, users may submit queries with more than one query terms combined by a Boolean expression, supported by all existing search engines. However, these search engines all put the same emphasis on each query term combined by the Boolean expression. That is, for the identical queries, different users would obtain the same searching results. This contradicts the fact that different users usually have different searching interests even with the same queries. In other words, a usefiil search engine nowadays should allow users to emphasize each query term unequally to get the more reasonable and individual searching results. In this paper we propose an efficient approach, named Extreme Score Analysis method (ESA method), to solve this problem. ESA method uses web pages' original scores to derive users' top K web pages when each query term is assigned with a different weight. Moreover, we improve ESA method and further propose Extreme Score Analysis Inverse method (ESIA method), which can efficiently find users' top K interesting target ranks when these users assign different weights to each query term. 1 Introduction As the fast process of computer and network the search engine returns all related web pages in technologies, computers connected over the Internet will sequence. Based on the ranking results, users can read soon become the indispensable electrical appliances of these returned web pages efficiently and get more our daily life. Through the Internet, people can get the relevant information, information they need much easier and quicker. Meanwhile, the rapidly growing data carried over the Usually, to enlarge or narrow the searching interests, Internet really make people very difficult to search and users may submit a query with more than one query filter the appropriate information they want. In recent terms and use a Boolean expression to combine these years, several search engines [Brin & Page 1999, Buttler query terms. Such a query is called a combined query. et; al. 2001, Ikeji & Fotouhi 1999, Kruger et. al. 2000, Although the search engines nowadays all support the Lee & Yuwono 1996a, Lee & Yuwono 1996b, Marchiori Boolean expression, they put the same emphasis on all 1997, Mauldin 1997, Schwartz 1998, Sheldon et. al. 1996] query terms combined by the Boolean expression. For had been proposed to reduce such overloads. example, when a user submits a combined query with two query terms, "job and salary", a typical search In general, in order to support a full-text information engine will return the web pages that contain both the retrieval, a search engine filters terms in the contents of two terms and all returned web pages are ranked web pages to establish a corresponding inverted index according to their own scores (We call such a rank of [McGill & Salton 1983, Yu et. al. 1999]. When users each web page original rank). Each web page's score submit a query with some proper terms (these terms are could be the sum of the individual contributed value for called query terms in the rest of this paper) as keywords, the two query terms in its own context. (Note that the the search engine looks these query terms up through the contributed value for a query term denotes the value inverted index to find out the corresponding web pages returned from the search engine's term weighting and ranks these web pages by its predetermined term function.) However, users may put more emphasis on weighting function and vector similarity function. Finally, one query term rather than on the other. For instance, an unemployed man may want to look for a job through the web pages, containing "job" information and "salary" as well. In this case the user might put more emphasis on "job" rather than on "salary". That is, for the identical queries, different users could obtain the different searching results because they usually have different searching interests. For quantify such a case, users should be allowed to assign different weights to each query term in the Boolean expression, e.g. "(0.7) job and (0.3) salary", which means the degrees of importance of the two query terms are in a ratio of 7:3. To fiilfill such a requirement for different weights of query terms, it is necessary to re-calculate the score of each web page according to the new given weights and then re-rank them (in this paper, the new rank of each web page is called target rank). Finally, users can get the real ranking results. However,a typical search engine only returns matched web pages with final scores, ranks, etc., but not the individual contributed value for each query term. To re-calculate the scores, there are two problems needed to be resolved. First, we must know the scoring function of the search engine. Usually, due to the commercial secret, a search engine's scoring function is unavailable, then we can solve this problem by adopting another scoring function, which has been pre-defined by our system. The second problem is that we have to scan every returned web page's content to get each query term's contributed value. Such a re-scanning task is too time-consuming to be practicable. In a real situation, most users usually only view the top K web pages from all HN returned web pages [Bergman et. al. 2000, Chaudhuri & Gravano 1999] {K « HN, where HN is the total number of returned web pages) for the sake of time or lacking of patient. Therefore, the second problem mentioned above can be resolved efficiently by focusing on the top K issue. Based on the top K issue, to avoid the overhead of recalculating the whole returned web pages, we will propose an efficient Extreme Score Analysis method (ESA method). ESA method, without re-scanning all returned web pages to get each query term's contributed value of each page, retrieves welD page's original scores and applies the numerical analysis to find out users' top K interesting target ranks.. It would inform users that their top K interesting target ranks will be among the top R original ranks (K < R « HN). In the follwing performance study in Section 4, we find that the value of R is close to that one of K. For example, if one user submits a combined query with two query terms and request the top K web pages according to his/her weights (assume 0.6 and 0.4, respectively), R is approximately equal to 1If the cost for re-scanning the top R web pages is allowable by users, our approach can provide users with their real top K target ranks. The rest of the paper is organized as following. In Section 2, we survey a typical search engine's components and discuss the fundamental of term weighting function and vector similarity function. Section 3 proposes the basic idea Extreme Score Analysis method (ESA method) and a system prototype with this method. Section 4 presents an experimental evaluation of ESA method. In Section 5, we further improve the efficiency of ESA method, named ESIA method {Extreme Score Inverse Analysis method). The mathematical analysis is presented in Section 6. Finally, Section 7 is the conclusion. 2 The components of a search engine When a user attempts to locate relevant information within a corpus in the Web, a web search engine system is the most common choice. A typical search engine system, composed of four components, is shown in Figure 1. Search Engine System -Query Search engine / Ranking \ V mechanism J ' Term weighting ~ mechanism Figure 1. A typical search engine a.) User interface The user interface of a web search engine is a HTML (hypertext markup language) form, which can be invoked by standard WWW client programs such as Internet Explorer and Netscape, and manages the interaction between a user and a search engine. It accepts a user's query, dispatches the query to search engine, and finally displays the searching results to the user. b.) Indexer Robot The indexer robot [Cho et. al. 1998, Introna & Nissenbaum 2000] is an autonomous WWW browser, which communicates with WWW servers using HTTP (Hypertext Transfer Protocol). It visits a given WWW site, traverses hyperlinks in a breadth-first manner, retrieves the web pages, extracts terms and hyperlink data from the pages, and inserts these extracted data into an indexer database. A list of target sites is given to the indexer robot for creating the index initially. c.) Indexer database The Indexer robot reads web pages and sends these web pages to the indexer database (is also named inverted index) to crate indexing records. If a change of a web page is made, the inverted index should be also updated. The update will not be made in the inverted index but for the indexer robot has revisited this web page again. To reduce storage overhead, each web page in the inverted index is represented by a set of proper terms. d.) Search engine This part is mainly composed of two important mechanisms: I.) Term weighting mechanism This mechanism adopts a term weighting function to assign a specific weight to each term of each web page. This weight indicates the importance of this term in representing this web page. As a result, a web page in the indexer database can be regarded as a vector p = (pi, ..., p„,), where pi is the weight of the rth term in representing the web page. A well-known term weighting function is named (/•^/(^[Kantrowitz et. al. 2000, McGill & Salton 1983], which assumes that term importance is proportional to the standard occurrence frequency of each term tj in each web page /// (that is, FREQiJ) and inversely proportional to the total number of web pages in the web page collection (that is, HOPFREQ,) to which each term is assigned. Then, a general form of term weighting function tfxidf can be WEIGHTij = FREQ.j x [logj («) - log2 {HOPFREQ^) + 1], where n is the total number of web pages in the collection. There are many others proposed term weighting functions, such as Signal-Noise Ratio and Term Discrimination Value [McGill & Salton 1983, Ricardo & Berthier 1999], Although every term weighting function has its own properties, all of them propose the same hypothesis that term importance is proportional to the standard occurrence frequency of each term in each web page as the one in tfxidf. Nevertheless, tfxidf is superior to the others in the aspects of efficiency and cost [McGill & Salton 1983, Ricardo & Berthier 1999]. Furthermore, since the web pages are written in hypertext markup language (HTML), each search engine may take some related factor (e.g. hyperlink, number of times the keywords occur in the document title, etc.) into account to provide more suitable ranking results [Aguiar & Beigbeder 2000, Birmingham et. al. 1999]. Consequently, most term weighting function applied by the existing search engines are derived from tfxidf II.) Similarity mechanism In order to rank all related web pages, a search engine assign relevance scores to them [Marchiori 1997]. The scores indicate the similarities between a given query and these web pages and this ranking task is implemented by a vector similarity function. Although there were many kinds of vector similarity functions, they all exhibited one common property, namely that the similarity value increases when the weight of the common properties in two vectors increase [Ricardo & Berthier 1999], Dot-product function, the basic of all vector similarity functions, has been widely used for the evaluation of retrieval functions [Fan et. al. 2000, Liu et. al. 1999]. Even the well-known Cosine function is simply the dot-product function with the weights of terms computed in a specific manner. By dot-product function, the similarity between a query (is also represented as a vector q = {qj, ...yqm), where qi is the weight of the /th term /,) and a web page could be represented as follows. m Similarityiq, p) = YiP.> ;=1 where pi is the weight of the rth term in representing a web page, q^ is the weight of the rth term in representing a query, and m is the number of terms in representing a web page. 3 Extreme Score Analysis method and the system prototype 3.1 Extreme Score Analysis method As discussed in Section 2, we can assume that when a user submits a combined query with n query terms, the score KSi (in this paper, we call it original score) that a search engine assigns to each web page //, (/ denotes the original rank of this web page among the returned matched web pages) will be KS^ = J Wj{ b{{FREQ,^ + ayHOCFREQp + c ), j=i where 1 < / < HN, HN is the number of all returned web pages. In the above equation, Wj is the weight of term j given in the user's combined query and b{{FREQij + a) / HOCFREQj) + c is the contributed value of term j in the original score KSi. Variables a, b, and c are used by each search engine to improve the tfxidf under the consideration of the web page's properties as mentioned in Section 2. Nevertheless, what our research concern is only about the ranking result of returned web pages. For the same search engine, whatever the values of Variables a, b, and c will be, the ranking result will never change. Furthermore, since all current search engines put the same emphasis on every query term (that is, iVj = \ / n), the original score can be simplified as KS,='^{\/n)xConVij, where 1 < / < HN, HN is the number of all returned web pages, and ConV/j = b{{FREQij + a) I HOCFREQj) + c. Obviously, if users assign each query term's weight unequally, each web page's original score may be changed. We call the score that is recomputed from the original score as target score >=i where 3 j,Wj ^M n. 332 Informatica 25 (2001) 329-340 C.-J. Tsai et al. For example, suppose that a user submits a combined query with two query terms, term p and term q, and the user wants to assign 0.7 and 0.3 to each query term's weight respectively. Then, the original score of the /th matched web page is KSi = Q.5ConVpi + 0.5Co«F,,. And, the target score of the rth matched web page will be KT, = O.lConVpi + 03ConVyi. First, to be convenient to explain our method, we consider the case with two query terms, term p and term q. Without loss of generality, we divide ConVp and ConVqby their maximum respectively to limit the value of them between 0 and 1, i.e., 0 < ConVp, ConV^< 1. And there are two lemmas and three theorems, which are applied in our ESA method, needed to be stated and proved as follows. Lemma 1. Let x-axis and y-axis denote ConVp and ConV^ respectively in the coordinate plane. Then a linear equation, which passes through the origin and is orthogonal to the straight line SC = Wp ^ ConVp + IV^ x ConV^, will be ConV^ = (W^ / Wp) x ConVp, where 0 < Wp, W,<1. Proof. According to Slope Theorem, the product of the slopes of two straight lines is -1, if the two lines are orthogonal to each other. Therefore, we can obtain the slope of the straight line, which is orthogonal to the straight line SC, is W^ i Wp. Moreover, because this straight line passes through the origin, we can get its linear equation is ConV^ = IVp) x ConVp. □ Lemma2. Let SC, = fVp x ConVp + W^ x ConV^ and SC: = Wp X ConVp + W^ X ConV^ represent two parallel lines in the coordinate plane and intersect the straight line ConV^i = (W^ / Wp) x ConVp in the point A and point B respectively. If the length of line segment oÄ is longer than the length of line segment OB, then the value of SCi is larger than the value of SC2. Proof. First, we can obtain the coordinates of A is (( SC, X Wp) /(Wp^+ w/), (SC, X fV^) I {Wp^+ and that of B is (( X Wp) / (Wp' + W/), (SC2 X W,) / (Wp' + W/)). Then, we can obtain the length of the two line segments are 0B = SC 'wj + w^ of the above two lemmas. In both figures, x-axis and y-axis represent the contributed values of the query terms p and q, respectively. The point (ConVp,, ConV,,;) in the coordinate plane denotes the web page //, returned by a search engine. Every dashed straight line, intersecting the point (ConVpi, ConV,p), denotes the score of the web page H,. In particular. Figure 2 represents the distribution of original scores of returned web pages, while Figure 3 represents the distribution of target scores. From the comparison of the two figures, we can observe that the dashed line, which denotes the score of web page //,, will alter when the weights of the two query terms vary. From Lemma 1 we can know the reason is that the original score considers the weights of both query terms are equivalent (that is, W^/ Wp= \), but target score does not (that is, Wy / Wp 1). As well, from Lemma 2 we can obtain that the further the distance between the dashed straight line and the origin is, the higher the score denoted by the dashed straight line will be. ConV, ConV„-ConV, Original score KS, KS, = O.SConV^ + O.SConV^ KS,= O.SConV^* O.SConV, A 0.5=0.5ConV+0.5ConV. KS„= O.SConV/ O.SConV, KS,= O.SConV^* O.SConV, Figure 2. The distribution of the original scores in the coordinate plane. ConV, 1 ^ \ v.", \ \ Target Score KT, KT,= W \fKonV^*W^ iNConV, ConV, = (W,/W,) iNConV, Kr,= H;iNConV,+ lV, iNConV, ConV, Clearly, since the length of line segment öÄ is longer than the length of line segment öß, we can easily infer that the value of SC, is larger than that of SC2. □ We use Figure 2 and Figure 3 to show the major concept KT„= W iNConV + W jNConV Kr,= W iNConV + W, |NConV Figure 3. The distribution of the target scores in the coordinate plane. Theorem L As shown in Figure 4, let the original score KSi = O.SConVp + O.SConV^ denotes a straight line in the coordinate plane, where 0 < ConVp, ConVq< 1, 1 < i < HN, HN is the number of all returned web pages, and 0 < W^< Wp < L Then, we can obtain an infinite number of straight lines KT; = Wp x ConVp + W^ x ConV^, which intersect KS/ and are orthogonal to the straisght line ConVq = (W^/ Wp) X ConVp.(Note that these straight lines KTi denote the probable target score of the original score KSj.) Therefore, for each original socre KS,, we can obtain a pair of maximum target score KT,„ax, and minimum target score KT,„im, which satisfy KT,„ax, < KT, < KSi > KS, then KT„,m > KT„,i„j and KT,> KT,, ' maxj' Proof. From Equations (7) and (8) in Theorem 1, we can get that if the values of term weights Wp and iV^ are fixed, then the larger KS, is, the larger KT„,axi and KT„,i„i will be. Therefore if KS^ > KSj, then KT,„i„i> KT„,i„j and KT„,axi > KT,„axi. □ Proof. Solve simultaneously the set of Equations (1) and (2): KSi = O.SConVp + O.SConV^, l^^r, = iVpX ConVp ConV^, By {l)x2Wp- (2), we can obtain Con V,x{iVp-W,) = 2WpX KSi - KT, ConF, = i2WpX KS, - KTd /{Wp-W,). (3) By i\)x2Wq- (2), we can obtain ConVp xiW,-Wp) = 2WyX KS I - KT, ConVp = {2W^ X KSi - KT,) / (W, - Wp). From the definition of Theorem 1 we can obtain (1) (2) ConV^ 1 ConV = ConV U Coni/, = (IV,/ Original score Maximum Target Score Minimum Target Score |NConl/j iNConV = KT Coni/. Figure 4. The relation between the original score and the target score. 0 KT,„^2> .■.>KT„,^i> ...> KT„,axHN. and KT,„i„i > KT,„i„2 > ...> KT„,i„i > ...> KT„,ì„hn. Suppose there is an original score KSj, whose maximum target score KT,„axj is larger or equal to KT,„inh then its corresponding target score KTj may be one of the top K of all target score KTi. Proof. Suppose there is an original score KSj (J > K), whose maximum taget score KT„,axj is larger or equal to KT„,i„ic, but KTj should not be one of the top K of all target score KTi. However, as asserted above, because we have KT„,i„i> KT,„i„2> •• > KT„,ì„k-i> KT,„,„k, the top K of all target score KTi will be KT,, KT;, KTj, ...,KTk in turn when ATZ^ = KT,,,,,,,, where I < i < K. Nevertheless, since KT,„axjis larger or equal to KT„ì„k, KTj may be larger than KT K and becomes the A:th rank of all KTi when KTj is equal to KT„,axj. As a result, our previous assumption is not true. Consequently, for each original score A:5;,if its maximum target score KT„,axj is larger or equal to KT,„i„it, then its corresponding target score KTj may be one of the top K of all target score KT,. □ As stated in Section 1, since the search engines do not return the individual contributed values ConVpi and Con V,ii for each returned web page Hi, the coordinates of web page //,-, {ConVpi, ConV^), may locate at anywhere on the straight line KSi = Q.SConVpi + O.SCo/iK,;. As a result, we can not obtain the real target score KTi for //,-. In Theorem 1, we have proved that the target score for each original score KSi will be between minimum target score KT,„i„i and maximum target score KT„,axi. That is, for given term weights Wp and W^, which are assigned by C.-J. Tsai et al. a user, we can derive a value interval of target score from each original score. Furthermore, in Theorem 2 we have proved that the distribution of KT„,axi and KT„,i„i decrease gradually with the decrease in original score. Figure 5 explicitly shows such a case. Finally, according to Theorem 3, in order to support the real top K interesting target ranks from the returned web pages, every web page Hi whose /Cr,,,«, > KT^mK should be a choice. Therefore, instead of viewing the whole HN returned web pages, the user only has to view those web pages, whose KTmaxi ^ KTmmK, to get their real top K target ranks. KT, KT„, KT_ KT„, ^^miiiHHI KT h Figure 5. The distribution of minimum target score and maximum target score. We have presented the basic idea of our ESA method by considering the case of two query terms in the above. Here, we will further extend it to the cases of n query terms. Without loss of generality, we \et iVj> W2> ... > iV„.,> and 0 < Co«F, < 1, where 1 < / < Wi denotes the weight of users' rth query term, and ConV, denotes the contributed value of the rth query term. Suppose users submit a query with « query terms, the original score for each returned web pages //, is KSi = (1 /n) X ConV, + (1 /«) x + ... + (\ /n) X ConV„. And, each returned web page's target score is KTi = W; X ConVi + W2X ConV2 + ... + fV„x ConV„. By (10) - fF, X /3 X (9) for each /, where 1 < / < /7, we can obtain n number of equations as follows. ■KTi=WiXnx KSi - (W, - W2) x ConV2 - {W,-W3) X ConVs -...-{W,- fV„) X Co«K„. KTi=W2xnx KSi - {W2 - r,) X ConV, - {W2 - W3) y X ConVj -...-{fV2-fV„) x ConV„. KTi = W„xnx KSi-{W„-W,) x ConV,-{W„-W2) xConV2-...-{W„-W„.,)xConV„.j. Since Wj> W2> ...>W„.,> W„ and 0 < ConV^ < 1, from each equation above, we can obtain an interval of possible values of the target score KF/. Therefore, for each web page N„ we have « intervals of possible values of its target score /CT, as follows. /^PV, X n X KSi - {W, - W2) X ConV2 - (W, - W3) x ConVs-...-(IV, - X Co«K„ < KT/ 2) query terms in our ESA method. Figure 5 shows the major steps in ESA method; and it also shows how to evaluate the required top R web pages (R > K), which have to be viewed by the user to get his/her real top K target ranks. ESA method requires five kinds of input data, including Wj, K, n, HN, and KS,, where Wj denotes the weight of each query term j assigned by the user, K is the number of the user's interesting target ranks, n denotes the number of a user's query terms, H^)s the number of related web pages returned by a search engine, and KSi is the original score of each returned page Hi. In Step 1, ESA method outputs each web page's maximum target score KT„,a^i and minimum target score KT,„i„i, and then these bounds are used in Step 2 to derive (H©Value of R. Without loss of generality, we let iV,> W2> ... > W„., > W„. Besides, to simply the expressions, we define two functions in Figure 6, which are S,lj,n) = r ° ifni+1 /M=y+1 0 H j-^ if7<2 if/ >2 m=l , and Figure 5. Extreme Score Analysis method with ESA method will for every returned web Stepl: /* In this step, the agent compute the KT,„axi and KT„„ pages.*/ Input data Wj, K, n, HN, KS, Output data KT„,i„^, KT^a^i For i = 1 to HN For J = 1 to « MinKT J =PFyX„x KS,- (n -J) iVj + Sj(J. ri); MaxKTj =wjxnx KS- (J-\)Wj + S^C/.y); End for KT„;„; = MaxiMinKT,, MinKT^, MinKT,); KT„,^i = MiniMaxKTi, MaxKTj, MaxKT,,)-, End for Step2: /* The top K interesting web pages will exist in the front of the top R original ranks of all HN returned web pages. */ Input data HN, KT,„ì„k, KT„,a,i Output data R For i = Ito HN if KT„„,i< KT„„„f: Then R = i- 1; Exit; End if End for Take Table 1 as the example, we consider the case with n = 2 and suppose the search engine returned 20 (= HN) related web pages. We also suppose a user wants to respectively assign 0.3 and 0.7 to each query term's weight, and the user requests the top 3 interesting web pages. That is, n = 2,HN= 20, W, = 0.3, W2 = 0.7, and K = 3. To understand and verify the result ESA method, we used a-generator to generate and ConVp randomly for these 20 web pages and then compute each web page's original score KS;. The actual value of each target score KTi also can be obtained as shown in Table 1. (Note that in a real situation, ConV^, ConVp, and KT, are unavailable.) In Step 1, the KT„,axi and KT„,i„i for every returned web page are derived. Then, Step 2 derives the total number of web pages (i.e., the value of R) that the user has to view. In other words, Step 2 tries to find the web pages whose KT„,axi > = (= 0.854). In Table 1, we can find that KT,„ax7 (= 0.866) > KT„,i„f:(k = 3), but KT,„ax8 does not. As a result, the user just needs to view the top = 7 returned web pages (i.e., H,, H2, H3, H4, H5, //fi, Hj) to get the real top 3 interesting web pages with the new given term weights. Additionally, a scanning mechanism can be also offered to avoid users to sieve the real interesting target ranks from a lot candidates when the value of K is large. Afterward, the user can directly get the real top 3 interesting target ranks (i.e.. Hi, H3, H5). More noticeably, with ESA method, instead of all the HN (= 20) returned web pages, only the top 7 web pages are scanned. Table 1. An example of ESA method (« = 2, HN = 20, Pf, = 0.3, = 0.7 and ^=3) Page Ogr KS, ^T^mini ^"^niaxi ConV„ ConV„ KT, H, 1 ,0.934 0.908 0.96 0.920 0.948 0.928 H2 2 0.868 0.944 0.870 0.942 0.892 Hs 3 0 S96 Ö:85'4 0.938 0.980 0.812 '0.930 ' H, 4 0.860 0.804 0.916 0.780 0.94 0.828 H, 5 0.85|7 0.800 0.914 0.950 0.764 0:894 H6 6 i)^ 0.735 0.887 0.770 0.852 0.795 H 7 7 0.7 "6 0.686 0.866 0.760 0.792 0.770 H, 8 0.726 0.616 0.836 0.990 0.462 0.832 H, 9 0.648 0.507 0.789 0.713 0.583 0.674 H K) 10 0.622 0.471 0.773 0.406 0.838 0.536 Hn 11 0.579 0.411 0.747 0.666 0.492 0.614 Hn 12 0.542 0.359 0.725 0.997 0.087 0.724 Hn 13 0.532 0.345 0.719 0.746 0.318 0.618 Hu 14 0.519 0.327 0.711 0.538 0.500 0.527 H,5 15 0.429 0.257 0.601 0.200 0.658 0.337 H]6 16 0.349 0.209 0.489 0.090 0.608 0.245 Hn 17 0.279 0.167 0.391 0.340 0.218 0.303 H18 18 0.114 0.068 0.16 0.220 0.008 0.156 Hi9 19 0.060 0.036 0.084 0.091 0.029 0.072 H20 20 0.056 0.034 0.078 0.01 0.102 0.038 *Ogr denotes the original rank of a web page 3.2 A system prototype with ESA method Figure 6 illustrates the data flow among the user, our system prototype, and a web search engine. There are three main agents in our system prototype, which will be discussed as follows. Figure 6. The data flow among a user, our system prototype, and a web search engine a.) Retrieving agent The retrieving agent accepts a user's query and dispatches this query to a web search engine. The original scores of all returned web page are retrieved and then conveyed to the second component, the computing agent. It would retrieve some web page's content and dispatch them to the scanning agent if such a task is necessary. b.) Computing agent This agent adopts ESA method and is responsible for deriving the candidate set of a user's top K target ranks. Its input data contain the original scores KS, and the number of returned web pages HN from the retrieving agent; and the number of query terms n, the weights Wj of each query term, and the value of K from the user. When a scanning task is required, the computing agent sends the top R target ranks to the retrieving agent to ask the content of these web pages, otherwise it returns this un-ranked result to the user. c.) Scanning agent The scanning agent is composed of two pre-determined ftinctions: the term weighting function and the similarity function, and two mechanisms: the full-text scanning mechanism and the ranking mechanism, it is an optional agent and works only when the user want to get his/her real top K target ranks. Initially, it accepts the content of the top R original ranks from the retrieving agent and then computes the contributed value of each query term in each retrieved web pages by the full-text scanning mechanism and the term weighting function. The similarity scores of these candidates are further computed by the similarity ftinction. Finally, the ranking mechanism sorts these top R target ranks and the scanning agent returns the ranked top K target ranks to the user. The simple pseudo code of this agent is presented as follows. Get the content of the top R original ranks from the retrieving agent For i = \ to R Scan Hi and compute the contributed value of each query termy; Score the web page //, according to the new computed contributed values and a new similarity function, which is predetermined by the system; End for Rank //, according to the target scores and return the top K target ranks to the user; Evaluation of the simulation model experiment 4.1. Experiment simulation In this section, to evaluate the efficiency of ESA method, we establish a simulation model for studying the effects based on different distribution of original scores by applying a generator to randomly generate several sets of the original scores. There are two performance measures in this simulation model. One is called Per, and the other is called Mul. Per denotes the percentage of web pages that a user need to view to get his real top K target ranks (i.e.. Per = I00x(/? / HN) %). Mul denotes the proportion of the number of a user's top K interesting target ranks to the number of returned web pages they have to view (i.e., Mul = R ! K). Each Per, Mul, and R is obtained by averaging the results of simulating 100 times. 4.2. Results The results of experiment simulation are presented in Table 2. The average of all Mul is 4.41 (that is, R ~ 4.4lA^, which means that on average users only have to view about 4.41 times the number of K to get their real top K interesting web pages. Even in the worst case Mul = 10.8 (where Wp=0.9, Wp= HN = 5QQQ, and ^ = 10), the number of web pages users have to view is about 10.8 times the total of their top K (=10) interesting web pages. In such a condition, users can request the system to implement the re-calculated task. In this case, the total of returned web pages is 5000, and our system has only to re-calculate 108 (= 5000 x 2.16 %) returned web pages to provide users with the real top AT (=10) target ranks, which is much more efficient than the way by recalculating all 5000 returned web pages. When users allow the system to re-calculate the real top K target ranks, the small Per in most cases has shown the efficiency of our method again. The average of all Per is 8.97%, which means that on average our system just need to re-calculate about 8.97% of all returned web pages to provide users with their interesting top K target ranks. Per will become much larger only when K is close to HN. However, a search engine usually returns much more web pages than a user's interesting top K web pages. That is, HN is usually much larger than K. Moreover, we have some observations from the simulation results: • For the same set of the query terms' weights and the same K, the larger the HN is, the less the Per will be. • For the same HN and K, the less the difference between each query term's weight is, the less the Per will be. • For the same set of the query terms' weights and the same HN, the smaller the K is, the smaller the Per will be. 5 Extreme Score Inverse Analysis method As shown in the previous example in Table I, ESA method has to first compute the maximum target score KTi„c,^j and minimum target score KT^im for each returned web page H,. When a search engine returns a large number of web pages, such a task is very time-consuming. In order to solve this problem, we propose an improved method as shown in Figure 7, called Extreme Score Inverse Analysis method (ESIA method). ESIA method only has to derive a predetermined original score threshold KSo, while ESA method must calculate KT„,i„i and KT,„axi for each returned web page H;. There is the major advantage in Step I of ESIA method compared to ESA method. Table 2. The result of experiment simulation w HN K R Per Mid Wp = 0.6 Wq = 0.4 5000 10 17 0.33% 1.7 20 33 0.66% 1.65 30 47 0.924% 1.57 1000 10 17 1.62% 1.7 20 32 3.10% 1.6 30 46 4.57% 1.53 500 10 16 3.01% 1.6 20 31 6.07% 1.55 30 46 9.01% 1.53 30 45 44.3% 1.5 Wp = 0.7 Wq = 0.3 5000 10 26 0.51% 2.6 20 50 1.00% 2.5 30 74 1.47% 2.47 1000 10 25 2.45% 2.5 20 47 4.68% 2.35 30 72 7.19% 2.4 500 10 25 4.85% 2.5 20 48 9.49% 2.4 30 71 14.14% 2.37 Wp = 0.8 Wq = 0.2 5000 IO 48 0.95% 4.8 20 93 1.86% 4.65 30 122 2.43% 4.01 1000 10 41 4.08% 4.1 20 81 8.01% 4.05 30 125 12.50% 4.17 500 10 42 8.37% 4.2 20 78 15.54% 3.9 30 122 24.37% 4.01 Wp = 0.9 Wq = Q.\ 5000 10 108 2.16% 10.8 ^ 20 201 4.01% 10.1 30 280 5.59% 9.33 1000 10 93 9.21% 9.3 20 179 17.88% 8.95 30 279 27.84% 9.3 500 10 96 19.02% 9.6 20 177 35.22% 8.85 30 245 48.91% 8.17 Average 8.97% 4.41 Stepl: /*In this step, the agent with ESIA method will first derive the KT^in^ and then compute the original score threshold KSo. */ Input data Wj, K, n, HN, KSk Output data KS„ For7 = 1 to « MinKTj = X „ X KSk- («-J) Wj + S,(j, «); End For KT,„i„K = Max{MinKT,, MinKT:, ..., MmKT„y, For7 = I to n KSj = {KT,„„k + {j-\)Wj- S,{j. n)) / (« X Wj)-End For KSo = Max{KS,, KS2, ...,KS„); Step2: /* The top K interesting web pages will exist in the front of the top R original ranks of all HN returned web pages. */ Input data HN, KS i Output data R For ; = 1 to HN if KSi< KSo Ihen /? = /-]; Exit; End If End For Figure 7. Extreme Score Inverse Analysis method. Since we are not sure which MaxKTj is the minimum, we expand the above equation and can obtain n values of the original socre KSj= {KT„,„k +{j-\)Wj- S,Q, n)) / (n x W^, where Si(j, n) is defined in ESA method, n is the number of users' query terms, and I KSo). We can observe that the result is identical to the one by ESA method. Most notably, without calculating KT„,ì„ì and KTi„axi for each returned web pages //,, ESIA method can be much more efficient than ESA method. 6 Evaluation of the mathematical analysis model 6.1. Mathematical analysis Unlike the evaluation in Section 4, since the original scores do not have a fixed distribution format, to proceed with some mathematical analysis in this section, we assume the original scores returned by the search engine distribute uniformly in the interval (0,1). (Note that a web page with original score = 0 will not be returned by the search engine.) Similar to Section 4, in this mathematical model we still use the two performance measures. Per and Mul as stated in earlier. To distinguish the evaluation in this section from that in Section 4, we use the superscript * in Per*, Mul*, and R*. KSo= W, X KSkIW.. (18) Without loss of generality, we let Wp Equation (7) and (8) in Theorem 1 we can obtain that W^. From If^5(> 0.5 then mini - yyq T fVp {2KSi -/), If < 0.5 then (H) (12) (13) (14) Furthermore, from Theorem 3 and Equations (11), (12), (13), and (14), we can get that If 0.5 and KSk> 0.5 then SKT,„^o=Wp+W,(2KSo-\), ^KT„„„k= W,+ Wp{2KSK-\). So we can get KSg= {Wp X KSk- Wp+ W,) ! W,. (15) lfA:Sfl<0.5 wdKSK>0.5 .'KT„,a.o=Wp+Wp{2KSo-\), So we can get KSo=(2Wp X KSk-Wp+ W,) ! 2Wp. (16) If 0.5 and KSk < 0.5 then KT„,a.w=Wp+W^ {2 KS„-\), So we can get KSo={2W, X KSk-Wp+ W,) ! 2W,. (17) If < 0.5 and KSk < 0.5 then SKT„a.o=Wp+Wp{2KSo-\), XKT„„„k=W,+ W,(2KSK-\). So we can get On the assumption that original scores returned by the search engine distribute uniformly, the set of original scores will be an arithmetic sequence whose maximum is 1 and common difference is 1 / HN. Then, we can obtain KS>=\-{{i-\)IHN), and the term To^ KSo'm this arithmetic sequence is T=^HN{\ -KSo)+\. Finally, we can get Per* = 100 X (Ti HN) % = 100 (1 + (1 / ///V) - A:5„). (19) After substituting Equations (15), (16), (17), and (18) into Equation (19), we can get/"er* as following. lf/:5'o> 0.5 and 0.5 then Per* = 100 X (1 + (1 ///AO - (({fVp x KSk-f^c) / fV^)) %. (20) If KSff <0.5 and KSk > 0.5 then Per* = 100 X (1 + (1 ///AO - ((2Wp x KSk-fV^) l{2Wp)))%. (21) If/:Sö> 0.5 and KSk < 0.5 then Per* = 100 X (1 + (1 ///AO - {{2W^ x KSk-Wp+ W^) l{2W,)))%. (22) If KSo< 0.5 and KSk < 0.5 then Per* = 100 X (1 + (1 / //AO - {W^ x KSk / Wp)) %. (23) After computing Per*, we can get R* = Round {Per* xHN), and Mul*=R*/K. 6.2.Result The results of mathematical analysis are presented in Table 3. From the comparison of Table 2 and 3, we can find that the results of the experimental simulation model and the mathematical analysis model are close to each other. As a result, according to the results of the mathematical analysis, if users allow some degrees of errors, our approach can inform users how many returned web pages they have to view without any calculation. Similarly, our proposed method can decide how many web pages he has to retrieve and re-scan to provide users with their real top K target ranks without any calculation. Table 3. The result of mathematical analysis w HN K R* Per* Mul* Wp = 0.6 Wq = 0.4 5000 IO 15 0.29% 1.5 20 30 0.59% 1.5 30 45 0.89% 1.5 1000 IO 15 1.45% 1.5 20 30 2.95% 1.5 30 45 4.45% 1.5 500 10 15 2.90% 1.5 20 30 5.90% 1.5 30 45 8.90% 1.5 Wp = 0.7 Wq = 0.3 5000 10 22 0.44% 2.2 20 46 0.91% 2.3 30 69 1.37% 2.3 1000 IO 22 2.20% 2.2 20 46 4.53% 2.3 30 69 6.87% 2.3 500 10 22 4.40% 2.2 20 46 9.07% 2.3 30 69 13.73% 2.3 Wp = 0.8 Wq = 0.2 5000 IO 37 0.74% 3.7 20 77 1.54% 3.85 30 117 2.34% 3.9 1000 IO 37 3.70% 3.7 20 77 7.70% 3.85 30 117 11.70% 3.9 500 10 37 7.40% 3.7 20 77 15.40% 3.85 30 117 23.40% 3.9 Wp = 0.9 Wq = 0.\ 5000 IO 82 1.64% 8.2 20 172 3.44% 8.6 30 262 5.24% 8.73 1000 IO 82 8.20% 8.2 20 172 17.20% 8.6 30 262 26.20% 8.73 500 IO 82 16.40% 8.2 20 172 34.40% 8.6 30 253 50.44% 8.43 Average 8.58% 4.02 7 Conclusion The existing search engines all put the same emphasis on submitted query terms combined by a Boolean expression. However, users may put different emphasis on each query term. That is, users could be allowed to assign different weights to each query term. For such a case, a system must re-calculate the score of each web page according to the new given weights to provide users with their real ranking results. Besides, in reality, most users usually view the top K web pages from those HN returned web pages only. Nowadays, a typical search engine does not return sufficient information for the system to immediately recalculate the new score. The system must rescan all returned web pages to provide users with their top K interesting web pages; however, such a re-calculating task is very time-consuming. Based on the top K issue, we have proposed an efficient ESA method, which adopts the original score to derive the candidates of users' top K target ranks, to solve this problem. We, fiarthermore, improve ESA method and propose ESIA method in Section 6. Through the use of the proposed method, our system does not have to rescan any returned web pages but can inform users that their top K interesting web pages will be among the top R web pages {Kr businc» hii.wl CusJinticr iL.» M Internet^ ^ Figure 4: Autonomous actions contained within Internet Banking transaction Action 1 : - A customer uses the Internet to connect to their bank's Web site; Action 2: - The customer browses the Web site and decides on a service. An Internet Banking transaction is initiated by the customer by providing both invoice and payment information; Action 3: - The bank checks if the transaction is executable by verifying the customer has enough funds available and a reply is returned to the customer; Action 4: - Upon completion of the transaction confirmation is sent to the customer; Action 5: - The bank honours the payment and returns proof of having done so. These transactions could be broken down further if deemed necessary. A decision table can then be used to assist in the identification of the essential security requirements for the Internet Banking environment as previously described. 7 Security decision analysis The decision table as shown in table 1 illustrates an example based on the scenario outlined previously. A brief discussion of the steps to develop such a decision table is also provided. steps Actions L Spheres < (N < m < < in < Customer X X X 2 Internet X X p fD Ö -O a fS a. ? 6 X 7T &. ID on 7 X 8 X X Table I: Decision Table Following is a brief description of each of the steps. Step 1 : Step 2: Step 3: - Step 4: - Step 5: - consists of listing all the security requirements that must be satisfied as discussed previously; consists of listing the spheres that have been identified. Only the three spheres shown in figure 3 are used i.e. bank, Internet and customer, whether home or business based; comprises listing the actions that make up a transaction. The five actions previously identified are used in the decision table; maps the actions onto the spheres identified in step 2. Naturally not all the actions will include all the spheres; associates the security requirements for an individual action. This information is then applied to establish how it can be implemented within the relevant sphere. In the decision table, action 1 shows that the bank must be able to identify and authenticate a customer satisfactorily to perform a transaction. At the same time the customer wants privacy regarding the personal account information being viewed. Privacy in this context refers to this information being unavailable to other parties. Actions 4 and 5 require the bank to send confirmation of the transaction and to ensure confidentiality and integrity of this message. Concurrently, the customer wants a guarantee that the bank cannot later deny that the transaction took place. This refers to the security requirement of non-repudiation. The table also indicates that the bank needs to record the transaction correctly in order to meet auditing requirements. By looking at the security requirements for each action, it is possible to identify the security mechanisms required to secure the Internet Banking environment. For action 1, the identification and authentication security requirement could very well be facilitated by the implementation of a smart card authentication system possibly with an accompanying biometrie mechanism. The required infrastructure through developed standards and technological know how has already been established for smart cards, providing certain support for this initiative. The security requirements for action 2 might suggest that SSL be used for securing the communication session (currently being used with 128-bit encryption) across the Internet. This may only be an interim approach or for the long term, depending on the implementation and widespread adoption of version 6 of the Internet Protocol (IPv6). Nevertheless it would be imperative to conduct timely checks on the protection provided by 128-bit encryption, with the high likeliness that it will be broken in the near future. The security requirements for actions 4 and 5 may be satisfied using SSL, although the acknowledgement needs to be digitally signed by the bank to conform to the security requirement of non-repudiation for all transactions. This would be catered for by the use of digital certificates. 8 Validation For the purpose of this demonstration, the following provides an evaluation of one of the identified scenarios based on the developed framework constructed previously [7]. The first evaluation is based on the consumer-to-business E-commerce environment depicted in figure 5. 354 Informatica 25 (2001) 349-355 D. Hutchinson et al. ;t _______ o -o _______ o- From this the decision table can be derived as ^ütera .,j shown in Table 2. I IniLTiii't Bunking Figure 5: Consumer Scenario (Cell Phones, PDAs) In this scenario the areas that must be secured include: • The consumer; • The terminal (cell phone or PDA); • The wireless and public network (telecommunication exchange); • The Internet (communication server) and • The Bank. This is represented in figure 6 below that also illustrates the autonomous actions contained within this particular environment. Consumer o -o. / Inlernet.. ;; ■ 5. 6 Figure 6: Autonomous actions contained within the cell phone, PDA scenario Following is an explanation of these actions. Action 1 - consumer uses a mobile phone or personal digital assistant (PDA) to connect to a wireless or other public network. Action 2 - Through the telecommunication exchange and Internet, the consumer is able to connect to their bank's Web site. Action 3 - The consumer browses the Web site shown on their cell phone or PDA screen and decides on a service; i.e. Transfer funds from account A to account B. Action 4 - The bank checks the validity of the consumers' request. Action 5 - The bank sends the confirmation to the consumer upon completion of the transaction. Action 6 - The bank honours the transfer and returns verification to the consumer. Step 3 Actions Spheres < (N < m < Tj- < < vo < (N Q. D Customer X X X X s ^ ja o PDA X X X T3 X) H. 3 n Wireless network X X X X Internet X X X Bank X X X X Security needs Q. S ÙO 1 X t/5 00 s § C ^ n. <~n 1 ft « o 2 X 3 x X X X X X 4 x X X X X X 5 X X X X 6 X 7 X X 8 X Table 2: Scenario Decision Table (Derived from Table 1) By viewing the security requirements for each action, the security mechanisms required to secure this Internet Banking environment can be suitably determined. For example confidentiality can be assured by a smart card acting as a veritable lock between the secret code on the chip and the unsecured terminal (in this case the cell phone, PDA, and telecommunication exchange) environment. In addition authentication can be provided for via the use of a PIN as well as an integrated digital signature and digital certificate associated with a smart card system. Further data integrity can be catered for via the use of Message Authentication Codes that are in-built into the Secure Socket Layer (SSL), which can be used for securing the Web session over the Internet. To prevent Internet based users from breaching the banking network, a firewall should be implemented to isolate the Web server from the customer information database. Finally, by complementing the identification and authentication process of Internet Banking based transactions with technologies like public-key cryptography, digital notary and digital signature, repudiation of transactions is protected. 9 Conclusion The entities involved in the transaction including the technological components are clearly defined and arranged accordingly. Naturally the various entities will require different security requirements based on their interaction within the specified Internet Banking environment. The model caters for this determination by providing a detailed decision table that amalgamates all the information gathered in the six-step process. This valuable cross-referencing method ensures that all avenues from whence contingencies arise are covered. The framework of authentication for Internet Banking allows customers to work their way through each step, identifying the necessary security requirements along with the counteracting authentication mechanism. The distinctive style of the framework including explicit descriptions, examples and cross-referencing capability ensures all security requirements and authentication mechanisms are sufficiently identified for correct and effective implementation. References http://www.zdnet.co.uk/news/2000/30/nsl7Q02.htm 1 (accessed 1 August 2000). [5] Knight, W (2000a) Barclays in security gaffe this Week, ZDNet UK, Wednesday 02 August 2000. http://www.zdnet.co.uk/news/2000/30/ns17040.htm 1 (accessed 3 August 2000). [6] Labuschagne, L (2000) A new approach to dynamic Internet risk analysis. Thesis (D.Com) -Rand Afrikaans University, South Africa, 2000. http://csweb.rau.ac.za/deth/acad/thesis/ (accessed 16 August 2000). [7] Hutchinson, D and Warren M.J (2001). A Framework of Security Authentication for Internet Banking, Information Society 2001, October, Ljubljana, Slovenia. [1] BBC (2000) Safety fears for web banking, BBC News Tuesday, 1 August, 2000 http://news6.thdo.bbc.co.uk/hi/english/busi ness/newsid 861000/861353.stm (accessed 8 August 2000). [2] Creed, A (2000) Western Union Site Down After Theft Of Credit Card Details, Newsbytes, Englewood, Colorado, USA, September 10. [3] Gutterman, S (2000) Western Union Web Site Is Hacked, Associated Press, September 10. [4] Knight, W (2000) Barclays security breach forces online service to close, ZDNet UK, Monday 31 July2000. Insights offered by data-mining when analyzing media space data Maja Skrjanc, Marko Grobelnik, and Darko Zupanic Jozef Stefan Institut, Jamova 39, Ljubljana, Slovenia Maja.Skrjanc@ijs.si. marko.grobelnik@.iJs.si. darko.zupanic@ijs.si Keywords: data mining, media space data, data analysis Received: June 30, 2001 Media space consists from many different factors fighting for the attention of customer population in a certain environment. Common problem in bigger environments (or countries) is that datasets describing complete media space is hard or almost impossible to get since the detailed picture is too complex or to expensive to compose. However, this is not the case in environments, which are smaller, and is therefore easier to collect the data. We have access to the data entirely describing the media space of popidation of 2 million people. Because of the language and economy this media space functioning relatively independently from different factors, specially outside the country. The data was collected by Media Research Institute Mediana. The database consists of 8000 questionnaires, gathered in 1998: The sample and the questionnaires were made by comparable research standards. In this paper we will discuss different type of questions, which might become in a great assistants in unfolding the media groups, audience fluctuation and profound understanding of happenings in media space, as well as for their predictions. 1 Introduction New emerging technologies enable more transparent communication between the media and the audience. As the response time for information feedback is becoming shorter, the general public can be more involved in the process of shaping the media. For the same reason measuring the media impact on the public is becoming much easier. Information about the media space, dynamics, interactions between the public and media are regularly monitored, collected and analyzed. Those type of information and knowledge raise different kinds og questions, which were also addressed in several other analysis [9,10]. Knowledge or new information extracted from the data is a value added information, which in this highly competitive environment represents a crucial factor. One of the possible approaches to get additional value from the gathered information is the use of the data mining techniques, which can not only contribute to the deeper data analysis, but can also create new additional services providing new insights into the data. Although Slovenia is a small media space is also very specific because of the language, it is not an exception in comparison to some other media environments. Since 1992 Media Research Institute, Mediana (http://wwvv.inn-mediana.si/) follows all printed, TV and radio media in Slovenia. They are trying to unfold and explain Slovenia's media image by collecting all kind of data about the media and analyze them with simple statistical methods. As a part of the Sol-Eu-Net project (http://SolEuNet.ijs.si/) we came into the position to analyze the data with more sophisticated data mining methods and present our outcomes to Mediana. In this article we will answer to some selected questions from the large space of interesting questions, arising from the need for better understanding of the media space. In the first section we will describe the quality, structure and contents of the data set. The second section will define the selected questions. In the third section we will describe our experiments by the methods we used to answer the questions. The answers are supported by many comprehensible examples of the rules and trees. At the end, in the fourth section we will describe the Mediana response and show some directions for the future work. 2 The Data In this section we will describe the contents, structure and quality of the analyzed data set. For the purpose of our analysis we took one of the Mediana's data sets, describing the whole Slovenian media space. The data was gathered by comparable international research standard. The data set consists of about 8000 questionnaires tracking all of the important reading, listening, and watching media in Slovenia. Each questionnaire consists of about 1100 questions collected in groups: about the person's relation to the certain media, person's activities, interests, life style, income, property, and demographic data. The relation of a person to the certain media is checked in more details with different questions testing the several aspects of using the media. The first page of the poll contains general questions about followed by 19 pages of specific questions. The most of the questions are asked in the way that the answer consists from a grading scale with 2, 5 or 9 levels. The data set is presented in a spreadsheet table, were each questionnaire represents a row and each question represents a column. In general, Mediana's dataset if of rather high quality meaning that we didn't have much work cleaning and transforming the data. 3 Questions Mediana didn't give us any specific questions; they just gave us a challenge to find something interesting in their data, which might be of any interest to them. Therefore, we had to think of some questions, which Mediana would find interesting in a way that the answers or techniques would represent a possibility for offering an additional commercial service. The requirements for the answers and resulting models to the selected questions were comprehensibility especially in the comparison to the classical statistical methods, which they have already used. For our analysis, we selected the following questions: Which printed media are also read by the readers of a selected newspaper/magazine? => What are the properties of readers/listeners/watchers of the specific media? => Which of these properties distinguish between the readers of different newspapers? => What are typical groups of people according to their personal characteristics? => Which media are similar? 4 Experiments 4.1 Methods To answer the selected questions we used the following methods: o Correlation based clustering of the attributes o Association rules (Apriori) [4] o Decision trees (C4.5 , C5) [5] o Clustering (K-Means) [7] o Kohonen Network [6] Our goal was to find some relations within the data, which are not obvious at the first sight and are comprehensible to Mediana. Due to that, our results do not optimize accuracy, but comprehensibility. Usually, our interpretation of the results does not rely only on the particular rule, but it is generàlized over a group of similar rules. By that, we can offer better interpretation and generalization. 4.1.1 Clustering of the attributes In this section we discuss the relation between the attributes. To determine the dependencies between the attributes we used clustering, where the distance was the correlation coefficient between the attributes. As the result we got clusters of attributes, which are correlated with the correlation coefficient above a certain threshold (0.5). Some of the resulting groups collect different attributes describing several aspects of the same media. For the most of the groups we were able to provide very comprehensive explanation. Some other groups consist of the attributes having very obvious relationships, like the group from EXAMPLE 1. This group is composed of the attributes dealing with the same media. Questions, which correspond to this particular attributes are: (l)Did you read magazine Golf in the last year (BMERead_Goli)? (2)How many issues did you read in last 12 months (BMElssues_Goli)? (3)How long ago did you read your last issue {BMELastReadJJolßl EXAMPLE 1 Correlations between different aspects of the same media. Attributtes: BMEReadGolf BMElssues_Golf BMELastReadJJolf Another type of clusters represents attributes with the high correlations between region and all or most of the editions of the same newspaper company. In particular, case in the EXAMPLE 2, the newspaper company Vecer is a local newspaper company. Vecer is the main daily newspaper and Večerov Četrtek, Vecer Televizija in Radio, Vecer v Soboto are its supplements. We can see that similar groupings as in the EXAMPLE 1 joined with some demographic attributes like region, community and local community. They emphasize the local influence of this media. EXAMPLE 2 Attributtes: REGION COMMUNITY LOCAL_COMMUNlTY QUESTIONER DERead_Vecer DESRead_Vecerov Četrtek DESRead_ Vecer. Televizija, Radio DESRead_Vecer v Soboto DEIssues_ Vecer DESIssues_Vecerov Četrtek DESIssuesJVecer, Televizija,and Radio DESIssues_Vecer v Soboto DELastRead_ Vecer DESLastRead_Vecerov Četrtek DESLastRead_Vecer Televizija, Radio DESLastRead_Vecer v Soboto Another type of clusters stress out the correlations between the attributes describing the person's age and spare time activities, which are part of a life style group of questions. EXAMPLE 3 presents the group of attributes describing the persons age and type of spare time activities: How often are you going to the cinema (spareTimejOinema), Do you study in your spare time {spareTime_Study), How often are you listening to CDs, LPs (spareTime_LP/CD)? Are you speaking English (languagesUnderstanding English)? When was the last time you were in the cinema (cinema)'! This type of clusters also point out that part of the person's life style is age dependent. EXAMPLE 3 Attributtes: spareTimeCinema spareTime_Study cinema basicDescriptions_Birth Year delivered_Age spareTime_LP/CD languagesUnderstanding_English An interesting but not unexpected type of clusters represents EXAMPLE 4. It presents the correlations between spare time activities and items possession. The person who is interested in the science (mediaInterest_Science) and the computer science {mediaInterest_ComputerScience) and is in a spare time is using a computer (spareTimeJJomputers) has very likely a computer at home {householdltemsHasPersonalComputer), as well as modem, internet access, video, CD ROM {householdItemsHas_Modem, householdltemsHasInternetAtHome, householdItemsHas_Video/ComputerGames, householdltemsHasjOD ROM2). EXAMPLE 4 Attributtes; spareTime_Computers mediaInterest_ComputerScience householdItemsHas_PersonalCompnter mediaInterest_Science householdltemsHasjOD ROM householdItemsHas_Video/Computer Games propertiesHas_PersonalComputer householdItemsHas_Modem householdltemsHasJnternet at Home Similar type of clusters as the one from EXAMPLE 2 is EXAMPLE 5. It also includes correlations from EXAMPLE 1. Here we can observe that all media from a certain province are highly correlated. Dolenjski List, TV Novo Mesto, Radio Krka, Studio D, Radio Sraka are media from the province Dolenjska. People obviously like to keep track of local events, which are covered mostly from local media. EXAMPLE 5 Attributtes: WERead_Dolenjski list WEIssues_Dolenjski list WELastRead_Dolenjski list . televisionsWatchedJTVNovo mesto televisionsLast WatchJTV Novo mesto televisionsDays_TV Novo mesto radiosLastListen_Radio Krka, Novo mesto radiosLastListen_Studio D radiosDays_Radio Krka, Novo mesto radiosDays Studio D radiosListened_Radio Krka, Novo mesto radiosListened_Radio Sraka - Novo mesto radiosListened Studio D 4.1.2 Association rules Association rules [4] are part of the standard selection of data mining techniques. We used them (as implemented in [I] and [2]) to answer the two questions: (l)how readers of one daily newspaper are related to the readers of other newspapers and (2)what is the relation to all other attributes. These questions were tested on the whole set of attributes, except the question about relations between different newspapers, which was tested only on the chosen newspapers attributes. Surprisingly, the answers of both tests were almost identical, Vhich gave us an idea, that reading certain newspaper is very much dependent on reading some other newspapers. The exceptions were two local daily newspapers, which are very region dependant. Lets look at some examples of association rules. EXAMPLE 6 and EXAMPLE 7 present the comprehensible interpretation of the rules. Attributes in the EXAMPLE 6 and the EXAMPLE 7 correspond to the question about the reading certain rnagazines and newspapers in the last year or in the last six months. The number at the end of the left side of the rule represents number of covered examples (support) and numbers on the right side of the rule represent the number of covered examples (confidence). As the results we got rules, which uncover the relations between different publications. These relations were very interesting - particular because of the nature of the topics these publications mainly cover. In the EXAMPLE 6 we got the rule, which associate readers of the biggest Slovenian daily newspaper Delo, with readers of magazines (Marketing magazin, Finance, Razgledi, Denar, Vip), which are covering mainly economics and marketing topics. In the EXAMPLE 7, the rules show the connection between the readers of Slovenske Novice and the publications, which are known more as a 'yellow press' (Sara, Ljubezenske zgodbe. Omama). They cover mostly romantic and erotic topics. This is not surprising since Slovensice Novice is known as a kind of yellow press daily newspaper. It is the most read newspaper in Slovenia. EXAMPLE 6. Interpretation: "The majority of the readers of any of the following publications: Marketing magazin, Finance, Razgledi, Denar, and Vip are also readers of Delo." Rules: DERead_Delo 1. MERead_Marketing magazin (sup.=l 16) ==> DERead_Delo (conf.=0.82) 2. WERead_Finance (sup.=223) ==> DERead_Delo (conf.=0.81) 3. BWERead_Razgledi (sup.=201) ==> DERead_Delo (conf.=0.78) 4. BWERead_Denar (sup.=197) ==> DERead_Delo (conf.=0.76) 5. MERead_Vip (sup.= 181) ==> DERead_Delo (conf.=0.74) EXAMPLE 7. Interpretation: "The majority of the readers of any of the following publications: Sara, Ljubezenske zgodbe. Dolenjski list. Omama, and Delavska enotnost are also readers of Slovenske novice." Rules: DERead_Slovenske novice 1. BWERead_Sara (sup.==332) ==> DERead_Slovenske novice (conf.=0.64) 2. WERead_Ljubezenske zgodbe (sup.=283) ==> DERead_Slovenske novice (conf.=0.61) 3. WERead_DoIenjski list (sup.=520) ==> DERead_S lo venske novice (conf.=0.6) 4. MERead_Omama (sup.=154) ==> DERead_Slovenske novice (conf.=0.58) 5. WERead_Delavska enotnost (sup.=177) ==> DERead_Slovenske novice (conf.=0.58) 4.1.3 Decision Trees Decision Trees [5] are also part of the standard repertoire of the data mining and machine learning techniques. We used C4.5 (as implemented in [1]) and C5.0 (as implemented in [2]) to describe the characteristics of the readers reading certain daily newspaper. Another question that we tried to answer was how readers of one daily newspaper differ from the readers of the other daily newspaper. With the decision trees we get the most natural and understandable interpretation for these problems. We try to identify typical description characteristics of the readers, including their life style, life statements, capability of trademarks recognition, their interests, etc. After putting together the rules and their interpretations, we got description of typical reader for every daily newspaper. Parts of descriptions correspond to the characteristics, Mediana already did with statistical methods, but some of the descriptions include very interesting personal characteristics. At first sight these rules seem unreasonable, but after some more careful analysis and discussions with the experts a perfectly reasonable explanation could be found. Those rules represent the most valuable results. We could observe that effect especially in the EXAMPLE 9, where we describe readers of the newspaper Slovenske novice. While testing we run algorithm on several groups of attributes, which describe the person's life style, statements, activities, interests, recognition of trade marks, statements. Our priority was comprehensibility of the trees. Since some newspapers have a very few readers, we had to adopt the thresholds and parameters, to get the right level of the comprehensibility. We presented the most promising trees in a form of rules. We took those rules, which have the accuracy above the certain threshold (0.65). On this set of rules we based our interpretations and each interpretation is based on several different rules. Examples 8-10 will present several interpretations and example of one the rules, this particular interpretation is based on. On the right side of the rule is class (True, False) and the first number stands for the number of cases and the second number for the accuracy. EXAMPLE 8: Description of readers of daily newspaper Delo Typical reader of Delo reads newspapers several times per week, has higher level of education recognizes certain trademarks of newspapers/magazines, cars, bears, washing powders, he or her is tracking information about the manufactures, shopping, information about inland news, likes to watch TV and videocassettes. The results fit well on our intuitive presumption. Several rules, which are the basis for the description above: Interpretation: Read newspapers several times per week. Rule: if (reading daily newsp. at all)>5) and (Tracking topical subjects at home)>3) and (Recognition of trade marks of daily newspapers >2) and (Recognition of trade marks of daily newspapers <=4) T (1051, 0.84) Interpretation: Recognize trademarks of newspapers/magazines, cars, bears, washing powders,... Rule: if (Recognition of trade marks of daily newspapers >2) and (Recognition of trade marks of daily newspapers <=4) -^7(2333, 0.672) Interpretation: Higher levels of education. Rule: if (reading daily newsp. at all >1) and (spend evenings at home>l) and (interest at animal world>l) and (recognition of trade marks of washing powder<=40) and (recognition of trade marks of magazines<=270) and ((Education)>5) -^7(242.9, 0.855) Interpretation: Watching TV and videocassettes. Rule: if (watching. videocasset.<=6) and (theater<= 4) and (Inf. about manufactures, shopping>4) and (recognition of trade marks of daily newspapers>0) and (recognition of trade marks of daily newspapers<=4) and ( (Time of getting up<=11) -^7(108.8, 0.845) Interpretation: Interested in inland news. Rule: if (newspapers weekly_read)>5) and (Topical subjects at home >3) T(1051, 0.84) EXAMPLE 9: Description of readers of daily newspaper Slovenske Novice Typical reader of Slovenske Novice is a regular reader of newspapers/magazines and likes to sit in coffeehouses, bars and sweetshops.: Recognize trademarks of newspapers/magazines, commercials for newspapers /magazines. They recognize less trademarks as readers of Delo newspaper and they also typically recognize different trademarks than readers of Delo newspaper, also reads Slovenski Delničar (magazine that covers economical topics), Jana (magazine tracking topics, which are more feminine). Kaj, Vroči Kaj (yellow press, erotic contents). If he or she is speaking Croatian then also probably reads Kaj magazine. The most interesting statement is the one saying that readers of Slovenske Novice like to sit in coffeehouses, bars, and sweetshops. This rule looks very strange, but just at first sight. Slovenske Novice is namely the newspaper that has the highest number of readers in Slovenia, but Delo newspaper has the largest edition. Slovenske Novice has the second largest edition. How is this possible? If look closer at the bars, coffeehouses or sweetshops, we could find in the most cases Slovenske Novice newspaper on the table. So, when you are in this in kind of places, besides drinking coffee, or eating sweets, you also read Slovenske Novice. The statements concerning trademarks could be profitably used for marketing planning. Interpretation: Regular readers of newspapers/magazines and like to remain sitting for a while in coffeehouses, bars, sweetshops,... Rule: if (reading daily newsp. weekly)>4) and (Visiting coffeshops more then weekly<=6) and (interest at animal world>l) and (reading SportNovosti ==F) T (331, 0.889) Interpretation: Recognize trademarks of newspapers/magazines, commercials for newspapers/magazines,... Rule: if (Remembering Commercials for daily newspaper >0) and (recognition of trade marks of Daily newspaper> 12) and (recognition of Trade marks of Daily newspaper<=16) ■^T(624, 0.887) EXAMPLE 10: newspaper Ekipa Description of readers of daily Typical readers of Ekipa also read Sportske novosti (newspaper which cover exactly the same topics as Ekipa, except that it is in Croatian language), they visit sport events, are interested in motoring, they like to experiment with novelties. Those characteristics are intuitive if we know, that Ekipa newspaper is dedicated to the sports, especially to team sports. Unexpected were characteristics, that they are very tidy and that they have mostly one child between 7 and 14 years old. Interpretation: Tidy. Like to experiment with novelties. Read Sportske novosti. Have at most one child between 7 and 14 years old. Rule: if (importance of tidyness, cleanness >3) and (trying new things >2) and (new challenges <=4) and (liking new things <=3) and (children 14 years old<=l) and (SportNovosti==T) ■^T(71. 0.712) All the examples until now are dealing with question of describing typical readers of a certain newspaper. How did we attach the other question, which try to distinguish between the readers of the two largest newspapers in Slovenia? We select only those readers who read either Delo or Slovenske Novice. The class for the learning became which newspaper they read. The decision tree divided the readers into the two distinctive groups. Tree is optimized for comprehensibility. EXAMPLE 11: How do readers from Delo differ from readers of Slovenske Novice? See Figure 1 spare time ■ computers less than several times per mentii more than several times per month settlement type urban rural politics not interested Interested travelogue Novice (1051.0/80.0%) hazard Delo (447.0/77.4°/ not interested interested not interested interested Novice (718.0/70.3%)! culture, history Delo (250.0/66.4%) Novice (367.0/61.6%) not interested interested .1 Novice (238.0/61.8%) hazard not interested Interested pelo (207.0/72.5%) Novice (218.0/52.8%) Figure 1: Decision tree for readers of Slovenske Novice and Dele 4.1.4 Clustering One of our interests was also the identification of different groups of people and to describe their characteristics, regardless on their interest for the media. First, to determine how many different distinctive groups could be find, we run Kohonen Network algorithm (as implemented in [2]). The result of Kohonen Networks was used as a K parameter in the K-means algorithni[7]. We run the K-means algorithm on 4 groups of attributes (life viewpoints, media topics, spare time activities, demographic properties like age, education, sex). As the result we got four clusters., consisting of 2550, 2124, 1385 and 1894 examples. For the description of cluster's characteristics we use C.5 algorithm, with the additional constrained of 200 examples per leaf. So, each cluster represents separate learning problem, where the target cluster (the cluster we try to describe) represents a positive class and the other three rep[resents a negative class value. We got rather comprehensive trees with the following interpretation; o 1st group consists of people younger than 30 years, they are not interested in topics like family, breeding, partnership; or younger than 20 years, they are interested in topics like living exciting, interested in novelties, films, challenges,... We could describe them in short as inspiring young people, o 2nd group consists of passive people, they don't like challenges, are not interested in entertainment, science, techniques, economics, their main satisfaction is family, they like life without major changes. They could be described as inactive older people. o 3th group consists of people with higher level of education, occupied by computers; mostly they are older than 30 years and listen to music; they are interested in most of the topics, they have classic taste, they are occupied by their children, promotions, novelties, challenges are important to them, they follow media quite often. We could describe them as ambitious people, o 4th group form older people, occupied by handicraft, they are not interested in sport, but they are interested in most of the other topics, they also like to get know with novelties, accepting as challenge. They can be referred as active older people. Although most of the results impressed Mediana, the clustering results were the most exciting for them, since they performed the same experiment on the same data with the classical statistical analysis [3]. It took then several steps to define the criteria for the groups they were pleased with. It took their time, resources and expert knowledge. Our results, which were practical the same, were gained in a very short time and without any prior expert knowledge. 5 Conclusions In the paper we presented an experience with the data mining analysis of the real world data describing the complete media space in Slovenia. Since the agency (private institute Mediana) providing us with the data didn't specify the goals and tasks we should follow, we decided for a set of tasks seemed to us as interesting. The dataset is a collection of approx. 8000 questionnaires each having approx. 1100 questions covering all kinds of topics about the personal interests and relationships of a questioned person to all important Slovenian media (newspapers, radio and TV) as well as it's personal interest, lifestyle, social status etc. For the analysis phase we decided to use several techniques with the some main goal to enable deeper understanding about the dataset. Since the number of attributes was fairly large (above 1100) we decided first to find highly correlated groups of attributes to give some insight into the structure of the questionnaire. Next, we created with the algorithm Apriori the association rules discovering the relationships in reading habits for the people reading more than one newspaper. Using decision tree learning we enlightened personal characteristics of the people reading certain newspapers and finally with clustering (K-Means) we split the people answering the questionnaires in to several groups according to the attributes describing their personalities and lifestyle. Most of the results were very useftil for the Mediana Institute to get additional insights into their own data. Some of the results have also potential to become additional commercial services offered by Mediana. Acknowledgement This work was supported by the EU project Sol-Eu-Net, 1ST-1999-11495, and the Slovenian Ministry of Science, Education, and Sport. References [1] Ian H. Witten, Eibe Frank (1999), Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, and the implementation of Weka system, Morgan Kaufmann [2] Clementine system (http://www.spss.com/clementine/) [3] Jasna Zdovc, (2000) Segmentation of audience by style (in Slovene), MSc Thesis University of Ljubljana, Slovenia [4] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. 1. Verkamo, Fast discovery of association rules. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.) Advances in Knowledge Discovery and Data Mining, AAAl Press/The MIT Press, pp. 307328, 1996. [5] Ross Quinlan (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, Inc. [6] T. Kohonen (1984) Self-Organization and Associative Memory, Springer-Verlag [7] Selim, S.Z. and Ismail, M.A. (1984). K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 81-87. [8] Arabie,P.,and Hubert, L., (1995) Advances in cluster analysis relevant to marketing research. In From Data to Knowledge, W. Gaul and D. Pfeifer, Eds. Springer, Berlin, pp. 3-19. [9] Leo Bogart (1989), Press and Public: who reads what, when, where, and why in American newspapers, Lawrence Erlbaum Associates, Publishers. [ID] Women and Elections '99" and "Elections in Croatia 2000 - 20 % is (not) Enough" , brochure of Women's Information and Documentation Center, Zagreb, Croatia (http://www.zinfo.hr/engleski/research.htm) Data mining of baskets collected at different locations over one year Dunja MIadenić*+, William F. Eddy+, Scott Ziolko+ *J.Stefan Institute, Ljubljana, Slovenia +Carnegie Mellon University, Pittsburgh, PA, USA e-mail: Dunja.Mladenic@ijs.si, http://www-ai.ijs.si/DunjaMladenic/ Keywords: Data Mining, Meta Mining, market basket analysis, association rules, decision trees Received: June 30,2001 This paper describes the fìrst steps in analysis of millions of baskets collected over the past year from a retail grocery chain containing hundreds of stores. Each record in the data set represents an individual item processed by an individual checkout laser scanner at a particular store at a particular time on a particular day. In order to get some insights in the data, we used several different approaches including some statistical analysis, some machine learning, and some data mining methods. The sheer size of the data set has forced us to go beyond usual data mining methods and utilize Meta-Mining: the post processing of the results of basic analysis methods. 1 Introduction We have obtained a data set form a retail grocery chain which contains all checkout scanner records for approximately one year. The data are delivered to the corporate headquarters on a weekly basis and are processed into a large corporate data warehouse at that time. We obtained a data feed from the in-house data processing activity. The corporate programs are written in COBOL and run on a large IBM mainframe computer. Thus, the records we obtain are in EBCDIC (rather than ASCII) and contain fields which are packed decimal and other non-standard formats. The data arrive weekly on a IBM 3590 cartridge tape which we insert into the Magstar tape library. The files are copied from tape onto the RAID storage and compressed (from 6GB to less than 2GB file size) so that they can be read on the Linux systems (which have a 2GB file size limit). We run our custom conversion program to convert the data from EBCDIC to ASCII and packed decimal to ASCII, etc. This produces one file for each hour of the week. These files are then sorted (by store, time, transaction number, etc.) to put all items in a market basket together. At this point the data are organized sufficiently for many different subsequent processing steps. Our standard processing generates a new file with one record per basket, listing the items in the basket on that record. We have other specialized projects which perform different processing on the basic sorted files. 2 Data Description Our data set consists of about a year of data collected over several hundreds of supermarket stores having different sizes and locations. Each record in the data set represents an item that was scanned at one of the checkout stations at a given store, day, and time. For each record we have a number of fields giving details about the conditions under which it was sold, such as price, information about potential price reductions applied (coupon sale, regular sale,...), department inside the store, checkout scanner number, and customer number. There are a few million baskets each week and a total of several million customers that are registered as "loyal customers". Each item is associated with a Universal Product Code (UPC) and there is additional information about the items themselves given in a 4-level hierarchical taxonomy. The top level includes nearly 100 categories of items, such as bakery, dairy, frozen food, salads, seafood, wine, etc. The next level, giving a finer grouping of items, includes several hundred groups, such as bakery mixed, bakery needs, candy, deli, fresh bread & cake, juice drinks, lettuce, milk fresh, pet food, etc. The third level includes a couple of thousand subgroups such as fresh fish, frozen fish, other seafood, cake decor and food color, fruit snacks, carrots, peppers, tomatoes, other vegetables, pasta sauce, etc. The leaf level contains a couple of hundred thousand items with their UPC codes, such as cherry angel food cake (within cupcakes within cakes within bakery), Hanover whole baby carrots (within carrots within frozen vegetables within frozen), 48% vegetable oil spread (within margarine-bowls within margarine and butters within dairy), "wht zinfan-del" (within wine-misc within wine within alcohol beverage). Because there are a couple of hundred thousand different items it is useful to group them somehow. We found it extremely difficult to create groups by clustering the names and other methods because the text descriptions do not provide common unique identification and it is sometimes difficult to group common products together. For example, there are 1909 entries in our database which contain the text string "MILK," including "MILKY WAY," BUTTERMILK PANCAKES," etc. Of these, 291 contain the string Distribution of Baslcet Sizes per Hour across ail stores for 10-06-00 S - I S - e o se ibBÖÖBÜÜQÜÜÜDbbCI 10 12 14 16 18 20 22 Figure 1: Side-by-side boxplots showing the distributions of basket sizes for each hour of one day across all stores. As expected the most items are purchased during the daytime (Hour 10 to 20 meaning 10 a.m. to 8 p.m.). Notice also that a considerable number of baskets have around 100 items. "FRESH MILK." Of those, 49 contain the string "2%." Five of those contain the string "1/2%." Thus there are 44 items which correspond to 2% FRESH ME.K coming from different suppliers, in different size containers, made of different materials. Because of these difficulties, in our work here we only use the third level of the taxonomy; all of our subsequent analyses are based on the couple of thousand subgroups. The main unit we are dealing with is a basket, corresponding to the content of a physical basket the customer presented at the counter. The number of baskets varies over hours and stores and so does the number of items in an individual basket. It is interesting to see how the average basket size varies during the day. Figure 1 shows the distribution of the basket size over different hours of one day for all the stores. As expected the most items are purchased during the daytime (10 a.m. to 8 p.m.), where 25% of the baskets contain more than 10 items with a considerable number of baskets having around 100 items. There are also some outliers with over 150 items, even one basket with about 200 items that was purchased around midnight (Hour 0 in the graph). All these outliers potentially reflect noise or error in the processed data and some simple statistical processing can help in identifying such situations. 3 Decision Trees Decision trees have often been used in Data Mining tasks such as finding cross-selling opportunities, performing promotion analysis, analyzing credit risk or bankruptcy, and detecting fraud. We use the C5.0 decision tree algorithm (an extension of C4.5 proposed by Quinlan[8]) deriving a rule set from a decision tree. We tried to predict the size of a basket based on several characteristics of the transaction. namely: the basket contents, the particular store number (this is an arbitrary corporate designation for the store), and time. We constructed rules predicting basket size within broad categories (Very Small = 1-3, Small = 4-10, Medium = 11-20, Large = 21-40, Very Large = 41 or more). For clarity we have rewritten some of the rules generated by the program. Figure 2 shows some of the rules derived (with pruning set to severity 75 and at least 10 examples per node) from the tree constructed for the hour with the smallest number of the baskets (4 A.M. to 5 A.M.) on an April Monday. The frequency counts of the five basket size categories are, respectively, (381, 84, 15, 3, 1). Since this hour is in the middle of the night most baskets are very small and the rules reflect this fact. The most significant rule notes that non-loyal customers have very small basket sizes. The remaining rules show that the three subgroups, ICE CREAM & DESSERTS, WHITE SLICED,ENTREE-SIDE DISHES, are important for distinguishing between small and very small baskets in the middle of the night. Rules for Very Small Baskets (all rules): if not loyal customer then Very Small (275, 0.921) if not (ICE CREAM & DESSERTS, WHITE SLICED, ENTREE-SIDE DISHES) then Very Small (445, 0.83) Rules for Small Baskets (all rules): if loyal customer and white sliced then Small (11, 0.692) if loyal customer and ICE CREAM & DESSERTS then Small (11, 0.53 8) if not ICE CREAM & DESSERTS and ENTREE-SIDE DISHES then Small (10, 0.5) Figure 2: Some of the decision tree rules for the hour from 4 A.M. to 5 A.M. The two numbers in brackets show the number of baskets covered by the rule and the fraction of baskets for which the rule holds. The rules for the hour with the highest number of the baskets for the same day (5 P.M. to 6 P.M.) are given in Figure 3. The frequency counts of the five basket size categories are, respectively, (20179,12855,6481, 3928,1195). The number of rules for the same five categories are (5, 32, 93,114,33). The most obvious feature of these is that they contain a considerably larger number of clauses. One rule, which we have omitted from the figure, predicts very small basket size for loyal customers if the basket contains EGGS and does not contain any of 39 other specific subgroups. We looked in more detail at the very small baskets and found 11,931 baskets with one item, 4,691 with two items, and 3,557 with three. The most frequent items in the one-item baskets (in descending frequency) were SNACK BAR, HOT PREPARED FOOD, EGGS, REGULAR COLAS, BREADS, 2% MILK, BABY FORMULA-LIQUID, etc. The most frequent combination pair of subgroups in the two-item baskets were HOT PREPARED FOODS (twice), FILM COUPON and FRONT RACK ITEM, and MILK (twice) (Film is usually displayed in the front racks). The most frequent triple of subgroups was BABY FORMULA-LIQUID, BABY FOOD-CEREALS, BABY FOOD-JUICES which only occurred 12 times. Rules for Medium (a subset of 93 rules): if LowerBound < StoreNo <= UpperBound and loyal customer and LUNCHMEAT and not (MEATS DELI, PASTA, BABY FOOD-CEREALS-COOKIES, and CAT FOOD-WET > 3 then Medium (21, 0.826) EGGS, POTATO CHIPS, POTATOES & ONIONS) if time <= 17:20:00 and loyal customer and (SHREDDED CHEESE, POTATOES & ONIONS) Apriori algorithm [2] using the publicly available implementation [4], a version of which is incorporated in the commercially available data mining package "Clementine-SPSS". In a typical data mining setting, it is assumed that there is a finite set of literals (usually referred to as items) and each example is some subset of all the literals. The Apriori algorithm performs efficient exhaustive search by using dynamic programming and pruning the search space based on the parameters given by the user for minimum support and confidence of rules. This algorithm has been widely used in data mining for mining association rules over "basket data", where literals are all the items in a supermarket and examples are transactions (specific items bought by customers). An association rule is an implication of the form X —>■ Y, where X and Y are subsets of literals and X nY = We say that the rule holds with confidence c if c% of examples that contain X also contain Y. The rule is said to have support s in the data if s% of examples contain and not (MEATS DELI, PASTA, YOGURT, EGGS, BACON, XUF. In Other words, we Can Say that for the rule X ^ y, UNK, BABY FOOD-CEREALS-COOKIES, CONVENIENCE FOODSJ^ then Medium (21, 0.82 6) Rules for Large (a subset of 114 rules): . if (REGULAR COLAS, EGGS, 2%, SOUR and not ( CONDENSED CANNED SOUPS, then Large (31, 0.667) CREAM) BACON) if loyal customer and (MEATS DELI, EGGS, POTATOES & ONIONS, BACON) and not (CANNED CHUNK LITE TUNA, PASTA, POLY BAG POTATOES, PAPER TOWELS) then Large (61, 0.635) Rules for Very Large (a subset of 32 rules): if (YOGURT, EGGS, POTATOES & ONIONS, UNK > 3) then VeryLarge (48, 0.9) if (REGULAR COLAS, then VeryLarge (77, EGGS, BACON, 0.722) LUNCHMEAT) its support estimates the joint probability of the rule items P{X, Y) and its confidence estimates the conditional probability of the rule's implication P{Y\X). We reduced the number of rules by imposing a ranking on the rules and keeping only the highly ranked rules. For each rule we calculated its unexpectedness by comparing the support of the rule with the estimate of support based on the item independence assumption. Specifically, we compared the squared difference between support and estimated support with estimated support. Large values of this statistic make large contributions to the chi-squared test proposed in [3]; see also, [5]. However, we didn't use the chi-squared test for cutting off the top rules since we had no evidence that our data would follow the chi-squared distribution. Instead, we kept the top 1000 rules out of between 40 000 and 400 000 rules, depending on the store and week. Number of repeating tripplea of itema Figure 3: Some of the decision tree rules for the hour from 5 P.M. to 6 P.M.. Notice that UNK is used for for Unknown. It is interesting to note that some of the rules use the (arbitrary) store number to predict basket size and one rule uses the time of day. Several of the rules apply only to loyal customers. 4 Association Rules Iho tripple repetitions In order to find associations between the items in the data, we used association rules. As usual in the market basket analysis, each example in our experiments corresponds to a single basket of items that the customer has purchased. Each example is thus represented as a Boolean vector giving information about presence of items in the basket. Using the data we generated association rules by applying the Figure 4: Number of repeating triples of items over the number of repetitions of that triplet. We have generated association rules on 18 weeks of data collected in year 2000 from mid April through mid October. There are a few weeks that we are missing due to some teciinical difficulties in the first phase of data processing needed for obtaining the data from the original format on the tape to our computers. Some of the rules repeat in different number of weeks, Figure 4 shows the number of different triples of items over the number of the triple repetitions. For each week the top 3000 highly ranked rules were selected, meaning that we have 54 000 rules selected in 18 weeks. Since many of the rules actually repeat in different weeks, there is 5856 different rules in the union of all the highly ranked rules of all 18 weeks. In these rules, 1952 different triples of items occur and 1059 of them occur exactly in one week, 377 repeat in exactly two weeks, etc (see Figure 4). Notice that the number of rules is exactly three times the number of triples, since each triple occurs in three distinct rules each having one of the triple item on the left side of the rule and the remaining two items on the right side of the rule. The reason for that is that our minimum confidence used by the rule generation agorithm was set to a very low value (0.1%). There are 780 rules with 260 different triples of items that repeat in the 3000 highly ranked rules of all the weeks. For illustration, we show some of that rules in Table 1. Some of the rules are formed from a frequent pair, such as spaghetti sauce and pasta or shampoos and conditioner-rinses, that is Combined with different frequent single items, such as bananas or tomatoes. The same rule has a different support in different weeks. 5 Discussion Most data mining approaches concentrate on extracting interesting properties directly from the collected data. There are different proposals on how to post-process the results in order to focus only on interesting properties. An other way of post-processing is mining the results of other data mining algorithms, applying Meta-Mining. For instance, finding spatio-temporal information by mining the induced rules [I]. Inferring higher order rules from association rules as proposed in [7] involves applying time series analysis on support and confidence of an association rules, that is repeating over time. This assumes that we have enough data collected over time and that the same association rules repeat over time. In our data, we found 260 triples repeating over all the weeks. The following four rules repeated in 86% of the stores in 9 consecutive weeks (15-May through 10-July): - (RTE KIDS or RTE WHOLESOME FAMILY) and SPAGHETTI SAUCE and PASTA - SPAGHETTI SAUCE and PASTA and TOILET TISSUE - SPAGHETTI SAUCE and PASTA and TOMATOES - MEXICAN FOODS and SHREDDED CHEESE and TOMATOES If we change our restrictions slightly, such that a rule needs to appear in at least 86% of the stores in 8 of the 9 weeks (15-May through 10-July) we get 48 rules such as: - MEATS DELI and LETTUCES and TOMATOES - TOILET TISSUE and PAPER TOWELS and FACIAL TISSUES - TOILET TISSUE and PAPER TOWELS and SOAPS - HAND & BATH - VEGETABLES and CARROTS and PEPPERS The most complete coverage are the two rules which are in 96.5% of the stores in all 9 weeks. These rules are: - GROUND BEEF and SPAGHETTI SAUCE and PASTA - VEGETABLES and LETTUCES and TOMATOES By applying a simple intersection on the sets of the most "interesting" rules found for different weeks and different stores, we additionally reduced the number of rules a person may want to check when searching for "stable rules" repeating either over time or/and over different locations (stores). More sophisticated methods are needed to obtain additional information from the generated rules, such as clusters of similar stores or time series analysis of the selected rule support or confidence. These are part of our on-going work investigating items in our every-week growing database of transactions. Acknowledgement The work reported in this paper was supported by the Slovenian Ministry of Education, Science and Sport. References [1] Abraham, T. and Roddick, J. F. (1998), Opportunities for knowledge discovery in spatio-temporal information systems, Australian Journal of Information Systems 5{2), pp. 3-12. [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In U. Fayyad, G. Piatetsky-Shapiro, R Smyth, and R. Uthurusamy (eds.) Advances in Knowledge Discovery and Data Mining, AAAI Press/The MIT Press, pp. 307-328,1996. [3] Brin, S., Motwani, R., and Silverstein, C. (1997), Beyond Market Baskets: Generalizing Association Rules to Correlations, In Procedeengs of the ACM Conference on Management of Data (SIGMOD-97), pp. 265-276, Tucson, Arizona, USA. [4] C. Borgelt. Apriori. http://fuzzy.cs.Uni-Magđeburg.de/~borgelt/. [5] DuMouchel, W. and Pregiben, D. (2001), Empirical Bayes Screening for Multi-Item Associations, In Proc. KDD, 2001, ACM. [6] Grobelnik, M., Mladenič, D. (1998), Learning Machine: design and impelmentation, Technical Report IJS-DP-7824, Department of Intelligent Systems, J.Stefan Institute, Slovenia, January, 1998. [7] Spiliopoulou, M. and Roddick, J. F. (2000), Higher Order Mining: Modelling and Mining the Results of Knowledge Discovery, Proc. 2nd Int. Conference on Data Mining Methods and Databases, WIT Press. (Ebecken, N. and Brebbia, C. A., Eds.), pp. 309-320. - BAKING CAKE-BROWNIE-COOKIE <- BAKING READY-2-SPREAD FROSTING and EGGS (0.11%, 71%) - EGGS <- BAKING READY-2-SPREAD FROSTING and BAKING CAKE-BROWNIE-COOKIE (0.11%, 32%) - BAKING READY-2-SPREAD FROSTING <- BAKING CAKE-BROWNIE-COOKIE and EGGS (0.11%, 24%) - SPAGHETTI SAUCE <- PASTA and FROZEN POULTRY (0.27%, 55%) - PASTA <- SPAGHETTI SAUCE and FROZEN POULTRY (0.27%, 52%) - FROZEN POULTRY <- SPAGHETTI and SAUCE PASTA (0.27%, 11%) - PASTA <- SPAGHETTI SAUCE and TOMATOES (0.67%, 47%) - SPAGHETTI SAUCE <- PASTA and TOMATOES (0.67%, 44%) - TOMATOES <- SPAGHETTI SAUCE and PASTA (0.67%, 28%) - PAPER TOWELS <- TOILET TISSUE and PAPER NAPKINS (0.22%, 50%) - TOILET TISSUE <- PAPER TOWELS and PAPER NAPKINS (0.22%, 45%) - PAPER NAPKINS <- TOILET TISSUE and PAPER TOWELS (0.22%, 14%) - SHAMPOOS <- CONDITIONER-RINSES and BANANAS (0.14%, 61%) - BANANAS <- SHAMPOOS and CONDITIONER-RINSES (0.14%, 28%) - CONDITIONER-RINSES <- SHAMPOOS and BANANAS (0.14%, 26%) - VEGETABLES <- LETTUCES and CARROTS (0.81%, 68%) - LETTUCES<- VEGETABLES and CARROTS (0.81%, 38%) - CARROTS <- LETTUCES and VEGETABLES (0.81%, 26%) Table 1: Example rules that repeat in all 18 weeks, if we generate association rules ignoring the store information. We show all three rules for the selected triples of items. For each rule we give its support and confidence from the mid May week. Support is'showing the fraction of baskets containing all the rule items, thus it is the same for all three rules containing the same tripped of items. Confidence is showing the proportion of baskets containing the pair of items in the right side of the rule that also contain the item in the left side of the rule. [8] J. Ross Quinlan (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, Inc. Multimedia supported study of achieving high worker's efficiency in relation to his work Zvone Balantič and Mojca Bernik University of Maribor, Faculty of Organizational Sciences Kidričeva 55a, SI-4000 Kranj, Slovenia zvone.balantic@fov.uni-mb.si, mojca.bernik@fov.uni-mb.si, Keywords: electronic publication, multimedia, education, ergonomics Received: November 15, 2001 Electronics, engineering and medicine expert collaboration as rule requires knowledge of both the remaining fields from each of experts to result in a useful application. Ergonomics and work process management are the goals achieved by such approach. Comprehensive analysis of work process, influenced by environmental factors and output, measured as a function of these influences, is a cornerstone of these multimedia CD compilation, presented both in a textbook and handbook format. In the electronic publication study contents are combined with practical experience of efficient Slovenian companies, which were prepare to present a piece of their experience. 1 Introduction The interactive electronic publication MAN-WORK-EFFICIENCY is prepared in a form of CD containing pedagogical and related scientific and research elements of analysis and synthesis of working systems. By a contemporary access - that is enabled by today's information technology - we tried to additionally motivate an independent and individual study, opened in the way of discussion and research. On the whole, this CD is divided in two parts (theoretical and practical one), which are dynamically linked and interwoven. Certain corresponding points in the contents provide the link between both parts, where theoretical part is linked to the practical one and inversely. Today, all the disciplines are increasingly interconnected, and, in this sense they are exploring new and interesting ways that usually lead towards the entirely new dimensions of thinking. Interdisciplinary themes can't withstand not to imitate and apply the complicated system into a related one, which is mathematically and algorithmically easier to be managed. In this process we are essentially helped by an up-to-date information technology that stimulates procedures which would otherwise demand much more painful work. Among single disciplines there is an intertwined net of information and because mastering of single specialty is subject to the inner field of profession, there was, for example, in the field of interaction between a man and a machine, established a science called ergonomics. This field is very important for in any human activity we are aiming to attain the optimal results by taking into consideration all human principles of human work [1]. The study of this scientific discipline is interdisciplinary one since at each step we are reaching for in the field of ergology, medicine. biomechanics, anthropology, physiology, psychology, sociology, environmental studies, work organization, system theory, technology, techniques and industrial design [2]. We can say in regards to performance qualities, human and computers complement each other, rather than compete. Our expectations of what information technology can do, should therefore be based on human and machine complementarily, and directed towards an active support to mental processes [3]. Taking in consideration such an extensive and ramified subject it was clear that some parts of the scientific literature had to be united on an integral basis of knowledge. In this process we could be helped by a multimedia access. 2 Working methods ] Electronic publication was produced in the Microsoft Windows 98 environment by means of Microsoft PowerPoint 97 tool. The content can be viewed and interactively communicated with by support of Microsoft PowerPoint Viewer 97 application. In preparing CD we kept to didactic and pedagogical principles. Electronic publication was first of all conceived as a manual with theoretical and educational contents meant for everyone who is dealing with analysis of working processes, from work organizers, safety engineers, to medical doctors specialized in medicine of work. Besides the above mentioned the CD is also a manual providing instructions for introducfion into research projects in the field of designing healthy working environments. It contains a series of practical examples from companies where working processes are already ergonomically designed and serve us as help, respectively, as a model. And indeed, this perspecfive gives us a superstructure of the acquired knowledge and represents applied solutions from different fields of elaborated topics. Z. Balantič et al. , UNIVERZA v MARIBORU > FAKULTETA ZA ORSANIZACUSKE VEDE .1 MoilftiTft I m z % ZVONE BALANTIČ: 3VEK - DELO ■ uClNEK Figure 1: Electronic publication MAN-WORK-EFFICIENCY The electronic publication MAN-WORK-EFFICIENCY could be used in the pedagogical process (in doing exercises - contents' comprehension; working in laboratory - solutions that are suitable for applications), in designing the working process at a workplace, in research projects (methods of measuring...), in encyclopedic refreshment of single topics (chapters' structure) etc. A "rounded book" is meant for students and professors in the field of organizing working processes, meanwhile in the field of direct application, to the top employees of the company and related production units, safety engineers, system theorists, consultants, design engineers, technologists... 01 't— a 5 -J > m hi o ■'t ail O ČLOVEK Navodila Uvod Literatura DELO 1. ERGONOlVISKE OSNOVE UČINEK 2. INFORMACIJSKI TOKOVI (ČLOVEK - STROJ) 3. BIOMEHANSKE OSNOVE 4. PORABA ENERGIJE PRI DELU 5. ČUTILA 6. MIKRokuMATSKO OKOLJE Kazalo Aplikacije Uvod ■ receptorji, čutila ' prilagajanje receptorjev in odziv čutil Svetlobni spekter Ožji svetlobni spekter ! Svetlotja-pojmi . Čutilo vida A.»«'o - Ciossl" failiä;! RAZISKAVE Oko Sestavni deli očesa Emetropija Hipermetropija Miopija AI