Informatica
An International Journal of Computing and Informatics
Theoretical Computer Science
Guest Editor: Andrej Brodnik
/i
The Slovene Society Informatika, Ljubljana, Slovenia
EDITORIAL BOARDS, PUBLISHING COUNCIL
Informatica is a journal primarily covering the European computer science and informatics community; scientific and educational as well as technical, commercial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international referee-ing. It publishes scientific papers accepted by at least two referees outside the author's country. In addition, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and information industry are presented through commercial publications as well as through independent evaluations.
Editing and refereeing are distributed. Each editor from the Editorial Board can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author's country. If new referees are appointed, their names will appear in the list of referees. Each paper bears the name of the editor who appointed the referees. Each editor can propose new members for the Editorial Board or referees. Editors and referees inactive for a longer period can be automatically replaced. Changes in the Editorial Board are confirmed by the Executive Editors.
The coordination necessary is made through the Executive Editors who examine the reviews, sort the accepted articles and maintain appropriate international distribution. The Executive Board is appointed by the Society Informatika. Informatica is partially supported by the Slovenian Ministry of Science and Technology.
Each author is guaranteed to receive the reviews of his article. When accepted, publication in Informatica is guaranteed in less than one year after the Executive Editors receive the corrected version of the article.
Executive Editor - Editor in Chief
Anton P. Železnikar
Volariceva 8, Ljubljana, Slovenia
s51em@lea.hamradio.si
http://lea.hamradio.si/~s51em/
Executive Associate Editor (Contact Person)
Matjaž Gams, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 219 385 matjaz.gams@ijs.si
http://ai.ijs.si/mezi/matjaz.html Deputy Managing Editor
Mitja Luštrek, Jožef Stefan Institute mitja.lustrek@ijs.si
Editorial Board
Suad Alagic (USA)
Anders Ardo (Sweden)
Costin Badica (Romania)
Vladimir Batagelj (Slovenia)
Francesco Bergadano (Italy)
Marco Botta (Italy)
Pavel Brazdil (Portugal)
Andrej Brodnik (Slovenia)
Ivan Bruha (Canada)
Wray Buntine (Finland)
Hubert L. Dreyfus (USA)
Jozo Dujmovic (USA)
Johann Eder (Austria)
Vladimir A. Fomichov (Russia)
Janez Grad (Slovenia)
Hiroaki Kitano (Japan)
Igor Kononenko (Slovenia)
Miroslav Kubat (USA)
Ante Lauc (Croatia)
Jadran Lenarcic (Slovenia)
Huan Liu (USA)
Suzana Loskovska (Macedonia)
Ramon L. de Mantras (Spain)
Angelo Montanari (Italy)
Pavol Nävrat (Slovakia)
Jerzy R. Nawrocki (Poland)
Nadja Nedjah (Brasil)
Franc Novak (Slovenia)
Marcin Paprzycki (USA/Poland)
Alberto Paoluzzi (Italy)
Gert S. Pedersen (Denmark)
Ivana Podnar (Switzerland)
Karl H. Pribram (USA)
Luc De Raedt (Germany)
Dejan Rakovic (Serbia and Montenegro)
Jean Ramaekers (Belgium)
Wilhelm Rossak (Germany)
Ivan Rozman (Slovenia)
Sugata Sanyal (India)
Walter Schempp (Germany)
Johannes Schwinn (Germany)
Zhongzhi Shi (China)
Oliviero Stock (Italy)
Robert Trappl (Austria)
Terry Winograd (USA)
Stefan Wrobel (Germany)
Xindong Wu (USA)
Executive Associate Editor (Technical Editor) Drago Torkar, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 219 385 drago.torkar@ijs.si
Publishing Council:
Tomaž Banovec, Ciril Baškovic, Andrej Jerman-Blažic, Jožko (Ćuk, Vladislav Rajkovic
Board of Advisors:
Ivan Bratko, Marko Jagodic, Tomaž Pisanski, Stanko Strmcnik
Dear reader,
in front of you is the final act of a very successful conference Theoretical Computer Science 04 (TCS) which took place as a sub-conference of the multiconference Information Society 04 on October 9th -15th 2004. The program committee of TCS received 23 contributions from 8 countries. After a thorough reviewing process it selected 12 papers to be presented at the conference. However, since the program committee wanted to bring the TCS conference closer to a general public, it decided to invite 10 contributions to a special, poster session. Besides these talks was at the conference also presented an invited talk by Prof. Ian Munro from University of Waterloo, Canada, with a title Succinct Data Structures.
Members of the program committee were:
-	Andrej Brodnik, University of Primorska, Chairman
-	Stefano Crespi-Reghizzi, Technical University Milano
-	Roberto Grossi, University of Pisa
-	Marjan Mernik, University of Maribor
-	Bojan Mohar, University of Ljubljana
-	Peter Paule, Research Institute for Symbolic Computation (RISC), Linz
-	Marko Petkovšek, University of Ljubljana
-	Tomaž Pisanski, University of Ljubljana
-	Borut Robič, University of Ljubljana
-	John Shawe-Taylor, University of Southampton
-	Boštjan Vilfan, University of Ljubljana
-	Gerhard J. Woeginger, University of Twente
-	Janez Žerovnik, University of Maribor
The program committee also received an invitation to publish the best papers from the conference as a special part of Informatica journal. The committee decided to invite authors of four contributions:
-	Miklós Bartha and Miklós Krész, Deterministic soliton graphs,
-	Sergio Cabello, Matt DeVos and Bojan Mohar, Expected case for projecting points,
-	Hovhannes A. Harutyunyan and Calin D. Morosan, The spectra of Knödel graphs, and
-	Bojan Mohar, On the crossing number of almost planar graphs.
Cabello et al. in the second paper consider a set of n points in a plane where a distance between any pair of points is at least one. They project these points on a random line which they split into segments (cells) of length one - such a line is called a graduated line. In the paper they show an upper bound of O(n^^3) for the expected concentration of projections on a graduated line. Their result is relevant in Computational Geometry for sweepline algorithms when the sweeping direction is chosen at random.
In the third paper Harutyunyan and Morosan study Knödel graphs. The Knödel graphs are applicable in distributed computing as they can be used for data broadcasting. The important property of Knödel graphs is their spectrum and authors in the paper show how to compute the spectra of Knödel graphs using results of Fourier analysis, circulant matrices and PD-matrices. From these results they derive the formula to compute the number of spanning trees each of which can be used to broadcast data in the graph.
In the last paper Bojan Mohar answers on a question posed by Riskin whether a crossing number of a graph G0+xy is equal to d, where G0 is a 3-connected cubic planar graph, and x, y e V(G) at a dual distance d. The answer is negative and holds also for 5-connected graphs planar graphs.
Andrej Brodnik
The authors were asked to thoroughly review their papers and extend them for journal publication. The rewritten papers were reviewed once more and now they are in front of you.
All four papers are on graphs and their use in solving various problems. In the first paper Bartha and Krész study soliton graphs. The soliton graphs are related to deterministic automata and the authors show how and when they can be reduced to simpler and more normal structures (chestnut graphs, generalized trees, and graphs having a unique perfect matching) not affecting their properties.
Deterministic Soliton Graphs
Miklós Bartha
Department of Computer Science Memorial University of Newfoundland St. John's, NL, Canada E-mail: bartha@cs.mun.ca
Miklós Krész
Department of Computer Science Juhäsz Gyula Teacher Training College University of Szeged, Hungary E-mail: kresz@jgytf.u-szeged.hu
Keywords: soliton automata, matching theory Received: March 23, 2005
Soliton graphs are studied in the context of a reduction procedure that simplifies the structure of graphs without affecting the deterministic property of the corresponding automata. It is shown that an elementary soliton graph defines a deterministic automaton iff it reduces to a graph not containing even-length cycles. Based on this result, a general characterization is given for deterministic soliton graphs using chestnut graphs, generalized trees, and graphs having a unique perfect matching.
Povzetek: (Ćlanek obravnava grafe brez lihih ciklov.
1 Introduction
One of the most ambitious goals of research1 in modern bioelectronics is to develop a molecular computer. The introduction of the concept "soliton automaton" in [5] has been inspired by this research, with the intention to capture the phenomenon called soliton waves [4] through an appropriate graph model.
Soliton graphs and automata have been systematically studied by the authors on the grounds of matching theory in a number of papers. Perhaps the most significant contribution among these is [2], where soliton graphs have been decomposed into elementary components, and these components have been grouped into pairwise disjoint families based on how they can be reached by alternating paths starting from external vertices. This paper can also serve as a source of further references on soliton automata for the interested reader.
Since soliton automata are proposed as switching devices, deterministic automata are in the center of investigations. The results reported in this paper are aimed at providing a complete characterization of deterministic soliton automata. The two major aspects of this characterization are:
1.	Describing elementary deterministic soliton graphs.
2.	Recognizing that deterministic soliton graphs having an alternating cycle follow a simple hierarchical pattern called a chestnut.
1 Partially supported by Natural Science and Engineering Research Council of Canada, Discovery Grant #170493-03
An important tool in the study of both aspects is a reduction procedure, which might be of interest by itself. It allows elementary deterministic soliton graphs to be reduced to generalized trees, and it can also be used to reduce chestnut graphs to really straightforward ones, called baby chestnuts.
2 Soliton graphs and automata
By a graph G = (V{G),E{G)) we mean an undirected graph with multiple edges and loops allowed. A vertex v e V (G) is called external if its degree is one, and internal if the degree of v is at least two. An internal vertex is base if it is adjacent to an external one. External edges are those of E(G) that are incident with at least one external vertex, and internal edges are those connecting two internal vertices. Graph G is called open if it has at least one external vertex.
A walk in a graph is an alternating sequence of vertices and edges, which starts and ends with a vertex, and in which each edge is incident with the vertex immediately preceding it and the vertex immediately following it. The length of a walk is the number of occurrences of edges in it. A trail is a walk in which all edges are distinct and a path is a walk in which all vertices are distinct. A cycle is a trail which can be decomposed into a path and an edge connecting the endpoints of the path. If a = vei... enw is a walk from v to w and ß = wfi... fk z is a walk from w to z, then the concatenation of a and ß is the walk a y ß = vei... enwfi... fk z from v to z.
A matching M of graph G is a subset of E (G) such that
no vertex of G occurs more than once as an endpoint of some edge in M. It is understood by this definition that loops cannot participate in any matching. The endpoints of the edges contained in M are said to be covered by M. A perfect internal matching is a matching that covers all of the internal vertices. An edge e G E{G) is allowed (mandatory) if e is contained in some (respectively, all) perfect internal matching(s) of G. Forbidden edges are those that are not allowed. We shall also use the term constant edge to identify an edge that is either forbidden or mandatory. An open graph having a perfect internal matching is called a soliton graph. A soliton graph G is elementary if its allowed edges form a connected subgraph covering all the external vertices. Observe that if G is elementary, then it cannot contain a mandatory edge, unless G is a mandatory edge by itself with a number of loops incident with one of its endpoints.
Let G be an elementary soliton graph, and define the relation ^ on Int{G) as follows: vi ^ v^ if an extra edge e connecting vi with v2 becomes forbidden in G + e. It is known, cf. [6, 2], that ^ is an equivalence relation, which determines the so called canonical partition of (the internal vertices of) G. The reader is referred to [6] for more information on canonical equivalence, and on matching theory in general.
Let G be a graph and M be a matching of G. An edge e G E (G) is said to be M -positive (M-negative) if e G M (respectively, e G M). An M-alternating path (cycle) in G is a path (respectively, even-length cycle) stepping on M-positive and M-negative edges in an alternating fashion. An M-alternating loop is an odd-length cycle having the same alternating pattern of edges, except that exactly one vertex has two negative edges incident with it. Let us agree that, if the matching M is understood or irrelevant in a particular context, then it will not be explicitly indicated in these terms. An external alternating path is one that has an external endpoint. If both endpoints of the path are external, then it is called a crossing. An alternating path is positive if it is such at its internal endpoints, meaning that the edges incident with those endpoints are positive.
Let G be a soliton graph, fixed for the rest of this section, and let M be a perfect internal matching of G. An M-alternating unit is either a crossing or an alternating cycle with respect to M. Switching on an alternating unit amounts to changing the sign of each edge along the unit. It is easy to see that the operation of switching on an M-alternating unit a creates a new perfect internal matching S(M,a) for G. Moreover, as it was proved in [1], every perfect internal matching M of G can be transformed into any other perfect internal matching M' by switching on a collection of pairwise disjoint alternating units. Consequently, an edge e of G is constant iff there is no alternating unit passing through e with respect to any perfect internal matching of G. A collection of pairwise disjoint M -alternating units will be called an M-alternating network, and the network transforming one perfect internal matching M into another M' will be denoted by N(M, M'). Clearly,
a) c-trail
A
b) l-trail
Figure 1: Soliton trails.
N(M, M') is unique.
Now we generalize the alternating property to trails and walks. An alternating trail is a trail a stepping on positive and negative edges in such a way that a is either a path, or it returns to itself only in the last step, traversing a negative edge. The trail a is a c-trail (l-trail) if it does return to itself, closing up an even-length (respectively, odd-length) cycle. That is, a = ai y a2, where ai is a path and a2 is a cycle. These two components of a are called the handle and circuit, in notation, h(a) and c(a). The joint vertex on h(a) and c(a) is called the center of a. An external alternating trail is one starting out from an external vertex, and a soliton trail is a proper external alternating trail, that is, either a c-trail or an l-trail. See Fig. 1.
The collection of external alternating walks in G with respect to some perfect internal matching M, and the concept of switching on such walks are defined recursively as follows.
(i)	The walk a = v0ev1, where e = (vo, v1) with vo being external, is an external M-alternating walk, and switching on a results in the set S(M, a) = MA{e}. (The operation A is symmetric difference of sets.)
(ii)	If a = voei... enVn is an external M-alternating walk ending at an internal vertex vn, and e„+i = (v„,v„+i) is such that en+i G S(M,a) iff en G S (M, a), then a' = aen+ivn+i is an external M-alternating walk and
S(M, a') = S(M, a)A{e„+i}.
It is required, however, that e„+i = e„, unless en G S (M, a) is a loop.
It is clear by the above definition that S(M, a) is a perfect internal matching iff the endpoint vn of a is external, too. In this case we say that a is a soliton walk.
EXAMPLE Consider the graph G of Fig. 2, and let M = {e, hi, h2}. A possible soliton walk from u to v with respect to M is a = uewgzihiz2l2z3h2z4lizigwfv. Switching on a then results in S (M, a) = {f, li,l2}.
Graph G gives rise to a soliton automaton A^g, the states of which are the perfect internal matchings of G. The input alphabet for AG is the set of all (ordered) pairs of external vertices in G, and the state transition function 5 is defined
Figure 2: Example soliton graph G
by
5{M, {v,w)) = {S{M, a) I a is a soliton walk, v to w}.
Graph G is called deterministic if Ag is such in the usual sense, that is, if for every state M and input (v, w),
IS{M, (v,w))| < 1.
Example (Continued) Observe that the soliton automaton defined by the graph of Fig. 2 is non-deterministic, as a = uewfv is also a soliton walk from u to v with respect to state M such that S (M, a) = S (M, a').
Let a be a soliton c-trail with respect to M. It is easy to see that the walk
s(a) = h(a) y c(a) ^ h(a)R
is a soliton walk, and the effect of switching on s (a) is the same as switching on the cycle c(a) alone. (For any walk ß, ßR denotes the reverse of ß.) If a is a soliton l-trail, then s (a) is defined as the soliton walk
s(a) = h(a) y c(a) y c(a) y h(a)R.
Clearly, this walk induces a self-transition of AG, that is, no state change is observed. In the sequel, all perfect internal matchings of G will simply be called states for obvious reasons.
Recall from [5] that an edge e of G is impervious if there is no soliton walk passing through e in any state of G. Edge e is viable if it is not impervious. See Fig. 2, edge h for an example of an impervious edge.
We are going to give a simpler characterization of impervious edges in terms of alternating paths, rather than walks. To this end, we need the following lemma.
Lemma 2.1. If a is an external M-alternating walk from v to u, then there exists an M-alternating network F and an external M-alternating trail ß from v to u such that
1.	F consists of cycles only, and it is disjoint from ß;
2.	S(M,a) = S(S(M, F),ß).
Proof. Easy induction on the length of a, left to the reader. □
An internal vertex v e V (G) is called accessible with
respect to state M if there exists a positive external M-alternating path leading to v. It is easy to see, cf. [2], that vertex v is accessible with respect to some state M iff v is accessible with respect to all states of G.
Proposition 2.2. For every edge e e E(G), e is impervious iff both endpoints of e are inaccessible.
Proof. If either endpoint of e is accessible, then e is clearly viable. Assume therefore that both endpoints ui and u2 of e are inaccessible, and let a be an arbitrary external M-alternating walk from v e Ext (G) to either u1 or u2, say u1. By Lemma 2.1, there exists a suitable external M-alternating trail ß from v to ui . Each internal edge lying on ß has an accessible endpoint, so that e is not among them. Moreover, the edge of h(ß) incident with u1 must be positive with respect to S (M, a), otherwise h(ß) would be a positive external alternating path with respect to state S(M, F). (Recall that h(ß) is the handle of ß, and take h(ß) = ß if ß is just a path.) But then e must be negative with respect to S(M, a), or else h(ß)eu2 would be a positive external alternating path leading to u2 (with respect to S (M, F), or even M, since F is disjoint from ß). We conclude that the walk a cannot continue on e, because it must take the two positive edges incident with u1 before and after hitting that vertex. Thus, every time an external alternating walk reaches either endpoint of e, it will miss e as a possible continuation. In other words, e is impervious. □
3 Elementary decomposition of soliton graphs
Again, let us fix a soliton graph G for the entire section. In general, the subgraph of G determined by its allowed edges has several connected components, which are called the elementary components of G. Notice that an elementary component can be as small as a single external vertex of G. Such elementary components are called degenerate, and they are the only exception from the general rule that each elementary component is an elementary graph. A mandatory elementary component is a single mandatory edge e e E(G), which might have a loop around one or both of its endpoints.
The structure of elementary components in a soliton graph G has been analysed in [2]. To summarize the main results of this analysis, we first need to review some of the key concepts introduced in that paper. Elementary components are classified as external or internal, depending on whether or not they contain an external vertex. An elementary component of G is viable if it does not contain impervious allowed edges. A viable internal elementary component C is one-way if all external alternating paths (with respect to any state M) enter C in vertices belonging to the same canonical class of C. This unique class, as well as the vertices belonging to this class, are called principal. Furthermore, every external elementary component is considered a priori one-way (with no principal canonical class,
D
Figure 3: Elementary components in a soliton graph
of course). A viable elementary component is two-way if it is not one-way. An impervious elementary component is one that is not viable.
Example The graph of Fig. 3 has five elementary components, among which Ci and D are external, while Cq, C3 and C4 are internal. Component C3 is one-way with the canonical class {u, v} being principal, while C2 is two-way and C4 is impervious.
Let C be an elementary component of G, and M be a state. An M-alternating C-ear is a negative M-alternating path or loop having its two endpoints, but no other vertices, in C. The endpoints of the ear will necessarily be in the same canonical class of C. We say that elementary component C' is two-way accessible from component C with respect to any (or all) state(s) M, in notation CpC', if C' is covered by an M-alternating C-ear. It is required, though, that if C is one-way and internal, then the endpoints of this ear not be in the principal canonical class of C. As it was shown in [2], the two-way accessible relationship is matching invariant. A family of elementary components in G is a block of the partition induced by the smallest equivalence relation containing p. A family F is called external if it contains an external elementary component, otherwise F is internal. A degenerate family is one that consists of a single degenerate external elementary component. Family F is viable if every elementary component in F is such, and F is impervious if it is not viable. As it turns out easily, the elementary components of an impervious family are all impervious. Soliton graph G is viable if all of its families are such.
Example (Continued) Our example graph in Fig. 3 has four families: Fi = {Ci,C2}, Fq = {D}, F3 = {C3}, F4 = {C4}. Family F1 is external, F2 is degenerate, and F3 is internal. These families are all viable, whereas family F4 is impervious.
The first group of results obtained in [2] on the structure of elementary components of G can now be stated as follows.
Theorem 3.1. Each viable family of G contains a unique
one-way elementary component, called the root of the family. Each vertex in every member of the family, except for the principal vertices of the root, is accessible. The principal vertices themselves are inaccessible, but all other vertices are only accessible through them.
A family F is called near-external if each forbidden viable edge incident with any principal vertex of its root is external. For two distinct viable families F1 and F2, F2 is said to follow F1, in notation F1 ^ F2, if there exists an edge in G connecting any non-principal vertex in F1 with a principal vertex of the root of F2. The reflexive and transitive closure of ^ is denoted by . The second group of results in [2] characterizes the edge connections between members inside one viable family, and those between two different families.
Theorem 3.2. The following four statements hold for the families of G.
1.	An edge e inside a viable family F is impervious iff both endpoints of e are in the principal canonical class of the root. Every forbidden edge e connecting two different elementary components in F is part of an ear to some member C e F.
2.	For every edge e connecting a viable family F1 to any other family (viable or not) Fq, at least one endpoint of e is principal in F1 or F2. If the endpoint of e in F1 is not principal, then F2 is viable and it follows F1.
3	The relation is a partial order between viable families, by which the external families are minimal elements.
4	If F and G are families such that F ^^ G, then each non-principal vertex u of G is accessible from F, meaning that for every state M there exists a positive M-alternating path to u either from a suitable external vertex of F, if F is external, or from an arbitrary principal vertex of F, if F is internal. The path a runs
entirely in the families between F and G according to
*
I—
For convenience, the inverse of the partial order —— will be referred to as <g. Theorems 3.1 and 3.2 are fundamental regarding the structural decomposition of soliton graphs, and they will be applied liberally throughout the forthcoming sections (especially in Section 4). There will be no explicit reference made, however, to these theorems whenever they apply.
4 Chestnuts
Chestnuts have been introduced in [5] as a group of deterministic soliton graphs having a very simple and easily recognizable structure.
Definition 1. A connected graph G is called a chestnut if it has a representation in the form G = 7 + a1 + ... + a k with k > 1, where 7 is a cycle of even length and each a^ (i e [k]) is a tree subject to the following conditions:
AA
w;
Figure 4: A chestnut.
(i)	V(a,) n V(aj) = 0 for 1 < i = j < k;
(ii)	V (a, ) n V (y ) consists of a unique vertex - denoted by v, - for each i e [k];
(iii)	v, and v j are at even distance on 7 for any distinct
i,j e [k];
(iv)	any vertex w, e V(a,) with d(wi) > 2 is at even distance from v, in a, for each i e [k].
See Fig. 4 for an example chestnut.
Our first observation regarding chestnuts is that they are bipartite graphs. Let us call a vertex of a chestnut G outer if its distance from any of the vi's is even, and inner if this distance is odd. Then the inner and outer vertices indeed define a bipartition of G. Moreover, the degree of each inner vertex is at most 2. It is easy to come up with a perfect internal matching for G: just mark the cycle 7 in an alternating fashion, then the continuation is uniquely determined by the structure of the trees a,. Thus, G has exactly two states. It is also easy to see that the inner internal vertices are accessible, while the outer ones are inaccessible. Thus, the cycle 7 forms an internal elementary component with its outer vertices constituting the principal canonical class of this component. Moreover, 7 forms a stand-alone internal family in G. The rest of G's families are all single mandatory edges along the trees a,, or they are degenerate ones consisting of a single inner external vertex. Their rank in the partial order <g is consistent with their position in the respective trees a,, following a decreasing order from the leafs to the root. The family {7} is the minimum element of <g, and G has no impervious edges.
By the description above, every chestnut G is a deterministic soliton graph. Moreover, G is strongly deterministic [5] in the sense that, for each pair (vi, v2) of external vertices, there exists at most one soliton walk from v1 to v2 in each state of G. We are going to show that for every connected soliton graph G having no impervious edges, but possessing a non-mandatory internal elementary component, G is deterministic iff G is a chestnut.
Lemma 4.1. Let a be an external M-alternating path of a soliton graph G leading to a principal vertex v in some internal family F. Then a can be extended to a soliton trail, the center of which lies in a family H <g F.
Proof. Every possible continuation ß of a as an M -alternating path can only leave the family F by entering another family G <g F . It is therefore inevitable that, when ß finally returns to itself, this will happen in a family H <G F. The path ß must eventually return to itself, since G is finite. □
Lemma 4.2. Let ß be a soliton c-trail of a deterministic soliton graph G starting from v e Ext(G) with respect to state M. Then, starting from v, there exists no soliton trail with respect to M that is different from ß.
Proof. Assume, by contradiction, that an unwanted ó = ß exists. Clearly, c(5) = c(ß), otherwise the soliton walks s(ß) and s(S) would define two different transitions of AG in state M on input (v, v). Therefore we have h(ß) = h(S). Starting from v, let z be the first vertex on both h(ß) and h(ó) where these paths split into two different directions (or just use a pair of parallel edges to reach the same vertex). Thus, ß = 7 y ß' and ó = 7 y ó', where 7 is a suitable path from v to z, and ß', ó' are M-alternating c-trails starting out from z on different edges. Clearly, the last edge of 7 incident with z is positive, and the first edges of ß' and ó' incident with z are negative. Therefore the walk
X = 7 y ß' y h(ß')R y ó' y h(ó')R y 7R
is a soliton walk from v to v, which defines a self-transition of Ag. This transition, however, is different from the one defined by the walks s(ß) and 8(7); a contradiction. □
Theorem 4.3. Let G be a deterministic soliton graph with no impervious edges, and assume that G has a non-mandatory elementary component C lying in an internal family F. Then C consists of a single even-length cycle, F = {C} and F is a minimal element with respect to to the partial order <g.
Proof. Let M be an arbitrary state of G, and a be a negative external M-alternating path from some v e Ext(G) to a principal vertex z of the root R of F. Furthermore, let 7 be an M-alternating cycle in C. Since C is accessible through z, we can fix a soliton c-trail ß with respect to M such that ß starts out from v and c(ß) = 7. We can assume, without loss of generality, that R = C. For, we need to rule out the only possible scenario that is incompatible with this assumption, namely when R is a single mandatory edge e = (z,w). Since in this case there are two-way members of the family F (e.g. C), there exists an M-alternating R-ear (loop) e around w. The loop e gives rise to a soliton l-trail ó = aee, which, by Lemma 4.2, cannot co-exist with ß.
Now let us assume that R = C has an allowed edge different from the ones along 7. Clearly, C then has another M-alternating cycle 7' = 7. As above, it is possible to extend a to a soliton c-trail ß' with respect to M such that c(ß') = 7'. Again, this contradicts Lemma 4.2, knowing that ß = ß'.
We have seen so far that R = C, and C is spanned by 7. Moreover, the principal and non-principal vertices determine the two classes of C's canonical partition. No two principal vertices can be connected in C by an edge, since such an edge would be impervious in G. Suppose that there exists an edge e connecting two non-principal vertices u1 and u2 of 7. The edge e divides 7 into two M-alternating loops x1 and x2 originating from u1 and u2, respectively.
Also notice that vertex z on 7 lies outside one of X1 and X2. Consequently, a can again be extended to a soliton l-trail ó such that c(ó) = X1 or c(ó) = X2. (Remember that z and u1(2) are in different canonical classes.) A contradiction with Lemma 4.2 is reached, showing that C = 7.
It remains to be seen that F = {C}, and F is minimal with respect to <g. Any M-alternating C-ear e originating from two different non-principal vertices u1 and u2 is equivalent to an edge connecting u1 directly with u2, and so need not be considered separately. On the other hand, if the ear e was an M-alternating loop, then a could again be trivially extended to a soliton l-trail ó with c(ó) = e, which is impossible. Now assume that there exists a family G <G F, and continue a to obtain a negative external M-alternating path a' leading to a principal vertex of the root of G. By Lemma 4.1, a' can further be extended to a soliton trail ó having its center in a family H < G. The trail ó is therefore different from ß, which contradicts Lemma 4.2. The proof is now complete. □
Theorem 4.4. Let G be a connected deterministic soliton graph having no impervious edges. If G has a non-mandatory internal elementary component, then G is a chestnut.
Proof. Induction on the number of non-degenerate elementary components of G. Assume that a non-mandatory internal elementary component C exists in G. If C is the only non-degenerate elementary component in G, then C is an internal family by itself, and the statement of the theorem follows directly from Theorem 4.3. Now let G have more than one non-degenerate elementary component, and suppose that the statement is true for all appropriate soliton graphs having fewer non-degenerate elementary components than G. Let F denote the family of C. By Theorem 4.3, if F is internal, then F = {C} and F is minimal with respect to <G. Let G be a non-degenerate family such that F <G G, and G is either external or near-external. Clearly, if F is external, then G = F. Otherwise G can be found by stepping upwards in the partial order <G starting from family F, which is a minimal element by Theorem 4.3. Notice that this search must reject F itself, as F being near-external would imply that its sole member C is the only non-degenerate elementary component in G. Thus, in this case, G = F.
Let R be the root of G. We are going to prove that
1.	R is mandatory,
2.	G = {R}, and
3.	there is exactly one forbidden edge incident with R's unique non-principal (or non-external) vertex.
Fix a state M for G, and choose an M-alternating cycle Y in C. Since C is accessible from G, 7 can be extended to a soliton c-trail ß from v e Ext (G), where v is either in R, or it is adjacent to a principal vertex in R.
Proof of Statement 1. Assume, by contradiction, that R is non-mandatory, and distinguish the following two cases.
Case 1: R is external. Let u be the vertex in R where the path h(ß) finally leaves this elementary component. By
[1], there exists a crossing ó in R from v to another external vertex z via u with respect to some state AÌr of R. Modify M, so that its restriction to R is replaced by Hr , and let M' denote the resulting state. Clearly, the straight crossing ó between v and z in either direction is a possible transition of Ag in state M'. On the other hand, this crossing can also make a detour to include 7 through the appropriate section of ß that starts at u. Notice, however, that this detour is only available from one direction, depending on whether the M'-positive edge incident with u on ó points toward v or z. Nevertheless, the co-existence of these two different transitions violates the deterministic property.
Case 2: R is internal. Trivially, there exists a soliton c-trail ó with respect to M starting from v such that c(ó) runs entirely in R; a contradiction with Lemma 4.2.
Proof of Statement 2. As in the proof of Theorem 4.3, the existence of an R-ear as an M-alternating loop would immediately contradict Lemma 4.2.
Proof of Statement 3. Assume that there is more than one forbidden edge going out from the non-principal (or non-external) vertex of G to different internal families of G. By Lemma 4.1, each of these edges can be made part of a suitable soliton trail in G with respect to M, starting from vertex v. Since ß is also such a trail, a contradiction with Lemma 4.2 is inevitable.
Now we are ready to synthesize statements 1, 2, 3, and finish the proof of Theorem 4.4. It has turned out that the case F = G is not possible. Detach the mandatory family G from G, keeping the unique forbidden edge specified in statement 3 as an external edge in the remainder graph G'. Notice that, if R is internal, then its principal vertex can only be adjacent to external vertices, or else G would have impervious edges. Observe that G' is also deterministic, connected, has no impervious edges, and still has the non-mandatory internal elementary component C in it. Apply the induction hypothesis to establish G' as a chestnut. Finally, conclude that G is also a chestnut by sticking back the mandatory family G onto G'. The proof is complete. □
5 Reducing soliton graphs
A redex r in graph G consists of two adjacent edges e = (u, z) and f = (z, v) such that u = v are both internal and the degree of z is 2. The vertex z is called the center of r, while u and v (e and f) are the two focal vertices (respectively, focal edges) of r.
Let r be a redex in G. Contracting r in G means creating a new graph Gr from G by deleting the center of r and merging the two focal vertices of r into one vertex s. Now suppose that G is a soliton graph. For a state M of G, let Mr denote the restriction of M to edges in Gr. Clearly, Mr is a state of Gr. Notice that the state M can be reconstructed from Mr in a unique way. In other words, the connection M ^ Mr is a one-to-one correspondence between the states of G and those of Gr.
For any walk a in G, let tracer (a) denote the restriction
of a to edges in Gr. It is easy to see that if a is a soliton walk in G with respect to M, then so is tracer (a) in Gr with respect to Mr. Moreover, the soliton walk a can again be uniquely recovered from tracer(a). Consequently, the connection a — tracer (a) is also a one-to-one correspondence between soliton walks in G and soliton walks in Gr. Furthermore, M ' = S (M, a) holds in G iff (M')r = S(Mr, tracer (a)) holds in Gr. We thus have proved the following statement.
Proposition 5.1. The soliton automata Ag and Ag^ are isomorphic.
Notice, furthermore, that if an alternating unit goes through both focal vertices of r, then it must do so along the center of r. As a consequence we have:
Proposition 5.2. The function tracer establishes a one-to-one correspondence between the alternating units of G and those of Gr.
It follows from the previous two propositions that every edge e of Gr is allowed in Gr iff e is allowed in G. As to the two focal edges of r, they can either be allowed or not in G, even when Gr is elementary. This issue is addressed by Proposition 5.3 below.
Proposition 5.3. Let r be a redex in soliton graph G, and assume that Gr is elementary. Then G is elementary iff both focal edges of r are allowed in G, or, equivalently, each focal vertex of r has at least one allowed edge of Gr incident with it.
Proof. It is sufficient to note that either focal edge of r is forbidden in G iff the other focal edge is mandatory. Moreover, an arbitrary internal edge e of G is mandatory iff all edges adjacent to e are forbidden. □
Another natural simplifying operation on graphs is the removal of a loop from around a vertex v if, after the removal, v still remains internal. Such loops will be called secondary. Let Gv denote the graph obtained from G by removing a secondary loop e at vertex v. Clearly, if G is a soliton graph, then so is G v, and the states of G v are exactly the same as those of G. The automata Ag and Agv , however, need not be isomorphic. This follows from the fact that any external alternating walk reaching v on a positive edge can turn back in G after having made the loop e twice, while this may not be possible for the same walk without the presence of e. Nevertheless, it is still true that for every elementary soliton graph G, G is deterministic iff Gv is such.
There are loops, however, the removal of which preserves isomorphism of soliton automata. These loops are exactly the ones around the inaccessible vertices of G. Each such loop is impervious, so that its removal does not affect the automaton behavior of G.
6 General characterization of deterministic soliton graphs
Graph G is said to be reduced if it is free from redexes and secondary loops. A generalized tree is a connected graph not containing any even-length cycles. By this definition, the odd-length cycles possibly present in a generalized tree must be pairwise edge-disjoint, which explains the terminology.
The proof of the following statement is left to the reader as a simple exercise.
Lemma 6.1. Let a be an alternating cycle with respect to state M of an elementary soliton graph G. Then G has an M-alternating cycle a' and a crossing ß that intersects a.'.
Proposition 6.2. If G is nondeterministic, then G has an alternating cycle with respect to some state M. Conversely, if G is elementary and has an alternating cycle with respect to some state M, then G is nondeterministic.
Proof. Assume that G is nondeterministic. Then there exists a state M and soliton walks a, ß connecting the same pair of external vertices in such a way that S (M, a) =
S (M, ß ). Consider the network N (S (M, a), S(M, ß)). This network is not empty, and it consists of alternating cycles only.
Now let G be elementary, having an M-alternating cycle a. By Lemma 6.1 we can assume that G also has a crossing ß with respect to the same state M that intersects a. Consider the network N (S (M,ß), S (M, a)). This network will contain a crossing ß' different from ß, yet connecting the same two external vertices v1,v2. Thus, for the state M' = S(M, ß), A^g has two different transitions on input (v1,v2) resulting in states S (M ', ß ') and S(M ', ß) = M, respectively. □
The key step to our results in this section is Theorem 6.3 below. The proof of this theorem is rather complex, therefore we do not present it here. The interested reader is referred to [3] for a complete proof.
Theorem 6.3. Let G be a reduced elementary soliton graph. If G contains an even-length cycle, then it also has an alternating cycle with respect to some state of G.
For an arbitrary graph G, contract all redexes and remove all secondary loops in an iterative manner to obtain a reduced graph r(G). Observe that this reduction procedure has the so called Church-Rosser property, that is, if G admits two different one-step reductions to graphs G1 and G2, then either G1 is isomorphic to G2, or G1 and G2 can further be reduced to a common graph G1_2. In this context, one reduction step means contracting a redex or removing a single secondary loop. As an immediate consequence of the Church-Rosser property, the graph r(G) above is unique up to graph isomorphism. In a similar fashion, the process of contracting all redexes and removing all impervious secondary loops is called i-reduction, and
the graph obtained from G after i-reduction is denoted by
ri(G).
Theorem 6.4. For any graph G, if r(G) is a generalized tree, then G is a deterministic soliton graph. Conversely, if G is a deterministic elementary soliton graph, then r(G) is a generalized tree.
Proof. Clearly, G is a soliton graph iff r(G) is such. By Proposition 5.2, if r(G) is a generalized tree, then G does not contain alternating cycles with respect to any of its states. Proposition 6.2 then implies that G is deterministic. Conversely, if G is a deterministic elementary soliton graph, then so is r(G), containing no alternating cycles with respect to any of its states. (See again Propositions 5.2
and 6.2.) Thus, by Theorem 6.3, r(G) is a generalized tree.
□
Corollary 6.5. An elementary soliton graph is deterministic iff it reduces to a generalized tree.
Definition 2. A baby chestnut is a chestnut 7 + a1 +... + ak such that 7 is a pair of parallel edges and each tree a, (1 < i < k) consists of one edge or two adjacent edges.
Theorem 6.6. let G be a viable connected soliton graph possessing a non-mandatory internal elementary component. Then G is deterministic iff ri(G) is a baby chestnut.
Proof. 'Only if' By Theorem 4.4, G is a chestnut augmented by some impervious edges connecting the outer internal vertices with each other. Since each internal inner vertex, different from the base ones, is the center of a re-dex, we can eliminate all of these vertices using reduction, except of course the last inner vertex in 7, which will no longer identify a redex. After removing the secondary impervious loops generated during redex elimination, ri(G) becomes a baby chestnut.
'If' Blowing up 7 by inverse redex elimination, or stretching the trees a, in this manner preserves the property of being a chestnut, and any impervious loops added during this procedure may only stretch into impervious edges. Thus, the graph resulting from a baby chestnut after any number of blow-ups and stretches is still a chestnut with some additional impervious edges. □
Now we are ready to state the main result of this paper.
Theorem 6.7. Let G be a connected viable soliton graph. Then G is deterministic iff it satisfies one of the following two conditions.
1.	G i-reduces to a baby chestnut.
2.	Each external component of G reduces to a generalized tree, and the subgraph of G determined by its internal components has a unique perfect matching.
Proof. Immediate by Theorems 4.4, 6.6, and Corollary 6.5. □
7 Conclusion
We have presented a detailed analysis of deterministic soliton graphs. First we proved that every connected deterministic soliton graph having no impervious edges, but possessing a non-mandatory internal elementary component is a chestnut. Then we introduced a simple reduction procedure on graphs, and showed that an elementary soliton graph is deterministic iff it reduces to a generalized tree. Using i-reduction, we could provide a yet simpler description of chestnut graphs, and gave a characterization of deterministic soliton graphs in general.
References
[1]	M. Bartha, E. Gombäs, On graphs with perfect internal matchings, Acta Cybernetica 12 (1995), 111-124.
[2]	M. Bartha, M. Krész, Structuring the elementary components of graphs having a perfect internal matching,
Theoretical Computer Science 299 (2003), 179-210.
[3]	M. Bartha, M. Krész, Deterministic soliton automata defined by elementary graphs, in Kalmàr Workshop on Logic and Computer Science, Technical Report, University of Szeged, 2003, pp. 69-79.
[4]	F. L. Carter, A. Schultz, D. Duckworth, Soliton switching and its implications for molecular electronics, in Molecular Electronic Devices II (F. L. Carter Ed.), Marcel Dekker Inc., New york, 1987, pp. 149-182.
[5]	J. Dassow, H. Jürgensen, Soliton automata, J. Comput. System Sci. 40 (1990), 154-181.
[6]	L. Loväsz, M. D. Plummer, Matching Theory, North Holland, Amsterdam, 1986.
Expected Case for Projecting Points
Sergio Gabello and Matt DeVos
Institute for Mathematics, Physics and Mechanics, Ljubljana, Slovenia E-mail: cabello@imfm.uni-lj.si
Bojan Mohar
Faculty of Mathematics and Physics, Ljubljana, Slovenia E-mail: bojan.mohar@uni-lj.si
Keywords: randomized algorithm, unit distance, closest pair Received: June 14, 2005
Consider a set of n points in the plane with the property that any pair of points is at least at distance one. We study the expected concentrati on of the point set afterprojecting it onto a random graduated line. There is a lower bound of Q^n log n) given byMatousekin [4], and we provide an upper bound of O{n?//3).
Povzetek: Analizirana je gostota toìSk v ravnini z razdaljo najmanj ena.
1 Introduction
Let P be a set of n points in the plane. For a line L C R2, we can project the points P orthogonally onto L, which we denote by ^l^P). Imagine that the line L is a graduated line, that is, a line decomposed into line segments (cells) of length one. For a cell c C L, let Pop{P, c) be the population of the cell c after the projection, that is Pop{P,c) = \{p e P | ^lìp) e c}|. For a graduated line L, we say that its concentration Conc{P, L) is the number of points that its most populated cell gets; that is,
Definition 1. A point set P C R est pair is at least at distance 1.
is 1-separated if its clos-
Conc{P, L) =
max {Pop{P,c)}. c a cell of L
In a recent paper, Diaz et al. [3] consider the algorithmic problem of computing a graduated line that minimizes the concentration, that is, they are interested in Conc{P) = minL Conc{P, L). However, an asymptotically equivalent problem was considered by Kucera et al. [4] when studying a map labelling problem.
Here we are interested in the expected concentration that a point set has when projecting onto a random graduated line. Let L{a) be a graduated line through the origin with angle a with respect to the x-axis, and such that the origin is the boundary of a cell. We are interested in the expected concentration EConc{P) over all lines L{a)
EConc{P) = E„ [Conc{P, L{a))],
where a is chosen uniformly at random. Let us observe that, for an asymptotic bound on EConc{P), it is equivalent to consider that the lines L{a) pass through some other point of R2 instead of the origin.
If the point set P is arbitrarily dense, then it may be that Conc{P, L) > n/2 for any line L, and so EConc{P) = Q(n). However, the problem becomes non-trivial if we put restrictions to the density of the point set.
Our objective1,2 is to bound the value EConc(P) for any 1-separated point set. Kucera et al. [4] have shown that Conc(P) = O^n log n) for any 1-separated point set P. More interestingly, they use Besicovitch's sets [1] for constructing 1-separated point sets P having Conc(P) = Q^nlogn), which implies EConc(P) = Q^nlogn).
We will show that for any 1-separated point set P we have EConc(P) = O(n2/3). Therefore, it remains open to find tight bounds for EConc(P).
The rationale behind considering projections onto random lines is the efficiency of randomized algorithms whose running time depends on the expected concentration. As an example, consider a set of disjoint unit disks and any sweep-line algorithm [2, Ghapter 2] whose running time depends on the maximum number of disks that are intersected by the sweep line. Ghoosing the direction in which the line sweeps affects the running time, but computing the best direction, or an approximation, is expensive: Kucera et al. [4] claim that it can be done in polynomial time, and Diaz et al. [3] give a constant-factor approximation algorithm with O(nt log nt) running time, where t is the diameter of P. By choosing a random projection we avoid having to compute a good direction for projecting, and we get a randomized algorithm. The results in this paper become helpful for analyzing the expected running time of such randomized algorithms.
The rest of the paper is organized as follows. In Section 2 we introduce some relevant random variables and give some basic facts. In Sections 3 and 4 we bound
1 Supported by the European Community Sixth Framework Programme under a Marie Gurie Intra-European Fellowship.
2Supported in part by the Ministry of Education, Science and Sport of Slovenia, Research Program P1-0507-0101.
EConc{P) using the first and second moments, respectively.
2 Preliminaries
Let P = {pi,... ,pn} be a 1-separated point set, and let di,j = d{pi,pj ). We use the notation [n] = {1,...,n}. Without loss of generality, we can restrict ourselves to graduated lines passing through the origin. Let L{a) be the line passing through the origin that has angle a with the x-axis, and let p* (a) be the orthogonal projection of a point p onto L(a). Consider the following random variables for the angle a
(a) =
1
if 4p*(a),p*(a)) < 1, otherwise;
Xi(a)^ Xij (a); j=i
Xmax(a) = max{Xi(a),.. .,Xn(a)};
n	n n
X (a) = Y, Xi(a) = ^Y. Xij (a), i=l i=l j=l where a is chosen uniformly at random from the values [0, In words: Xi,j is the indicator variable for the event that p**(a) andp**(a) are at distance at most one in the projection; Xi is the number of points (including pi itself) whose projection is at distance at most one from p*(a); Xmax is the maximum among Xi,.. .,Xn ; and X counts twice the number of pairs of points at distance at most one in the projection. It is clear that P[Xi,i = 1] = 1 for any i e [n]. Otherwise we have the following result.
Lemma 1. If i = j, then
p[Xij = 1] =	/dij .
Proof. Assume without loss of generality that pi is placed at the origin and p j is vertically above it, on the y-axis. See Figure 1. We may also assume that the line L(a) passes through pi. Because di,j > 1, there are values a such that Xij(a) = 1. The angles that make Xij(a) = 1 are indicated in the figure. In particular, if ß is the angle indicated in the figure, and we choose a uniformly at random, then P[Xi,j = 1] = .The angle ß is such that
We conclude that
□
sin ß = d1—, and so ß = arcsin d. p^Xij = 1] = =	. '
The first observation, which is already used for the approximation algorithms described by Diaz et al. [3], is that, asymptotically, we do not need to care for the graduation, but only for the orientation of the line. In particular, the random variables Xi contain all the information that we need asymptotically.
Lemma 2. We have EConc(P))
2
< E [Xmax(a)] < 2 EConc(P).
3 Using the first moment
Using that the closest pair of P is at least one apart, we get the following result.
Lemma 3. For every i e [n)], we have
^ dt-, = Proof. Without loss of generality,
assume that i — n. Let nd be the number of points in P whose distance from pn is in the interval [d,d + 1). We have
1 ^ / 1
dr ■ ^ ^^	^^ di
je[n-i] i,j d=i \di,je[d,d+i) '-'■'J
d=l \di,j e[d,d+l)
CO
nd

d=i
d
(1)
(2) (3)
Observe that if we have two sequences (ai)i£N and (6i)ieN of nonnegative numbers such tha^ ^^ j=l ai <Yl, j=l bi for all j e N, then	<	. That is, the sum
is maximized when the values concentrate on the smallest possible indexes. Let Na be the maximum number of 1-separated points that you can have in an annulus of inner radius d and exterior radius d +1, and let D be the smallest value such that n D=l Nd. We have n = Y^ nd and Ej=l ni < Yji=l Ni for all j e [D], and from (1) we conclude
S ^^dd<i:
ie[n-i]
di
d=l
nd^ d
d=i
Nd d
(4)
We need to estimate the values Nd. For the lower bound, placing points at distance one in the circle of radius d, we get Nd = ü(d). For the upper bound, we can use a packing argument to show that any 1-separated point set inside the annulus has O(d) points. Indeed, if we place a disk of radius 1/2 centered in each point of a 1-separated point set inside the annulus, they must have disjoint interiors and cover an area of 0(Nd). Moreover, all these disks are contained in an annulus of inner radius d — 1 and exterior radius d + 2, which has an area of 0(d). We conclude that Nd = e(d), and therefore D = O(^n). Using (4) we get
S ^ ^ < S
Nd

O(d)
ie[n-i]
dii
= O^vn).
d=i
d=i
□
Lemma 4. For every i e [n] we have E[Xi] = O^^/n).
0
d
angles making
Xi.j = 1
\ angles making = 1
Figure 1: For Lemma 4. We consider what random lines L{a) through pi that give Xij = 1
Proof. Because Xi = Y^^=1 Xi,j and the linearity of the expectation, we have
E[Xi] ^ E[Xi,j] ^ P[Xi,j = 1]
j=i j=i
= 1+ ^ P[Xi,j = 1]
je[n]\{i}
E2arcsin(1/di,j ) n	'
3e[n]\{^}
Observe that the function arcsin(x) is convex for x e [0,1], and therefore we have arcsin(x) < (n/2)x for all x e [0,1]. We then have
E[Xi] = 1+ ]r
je[n]\{i}
2arcsin(1/di,j )
Xj (a) > t/2 for all p j e P'. We conclude that
X(a) > ^ Xj (a) > ^ t/2 > t/2 • \P\ > t2/4, pj eP	pj eP
and the claim is proved. We have shown that for any value t > 0 we have
> t] C [X > t2/4],
and using Markov's inequality we conclude
> t] < P[X > t2/4] <
4E[X^ 0(^7«)
t2 t2
Let r = \v?/4\. Since Xmax only takes natural numbers, we have
<
je[n]\{i}
and using Lemma 3 we conclude that E[Xi] = O^^Jn).
□
Using the first moment method, we can show that for any 1-separated point set P it holds that EConc(P) = 0(n3/4). For this, consider a 1-separated point set P and its associated random variable X. We have X ^ X', and because of Lemma 4 we conclude E[X] = O(^^Jn).
We claim that, for any value t > 0, if we have Xmax(a) > t, then X(a) > t2/4. Intuitively, if some X' = t, then there are 0(t2) pairs of points at distance at most one from each other, and so contributing to X. The formal proof of the claim is as follows. Let i be an index such that Xi(a) > t. Then, either to the right or to the left of p*(a), the projection of p' onto L(a), there are at least t/2 points p*(a) at distance at most one from p*(a). Assume that those points are to the left and let P' C P be the set of those points. We have \pP\ > t/2. For any p j ,pj' e pP we have Xj^i (a) = 1, and therefore we have
E[Xmax] =Y. P[Xmax > t]
t=1
r	n
= Y, P[Xmax > t]+ ]r
ax > t]
t=1
t=r+1
< è 1+ è
t=1	t=r+1
rn 1
< r +	^dt
< n3/4 + O(nVn)
= 0(n3/4).
t2 rn
/
Using Lemma 2 it follows that EConc(P) = O(n3/4). However, observe that this bound will be improved in next section.
We would like to point out that the random variables X' do not have a strong concentration around their expectation. Therefore, we cannot use many of the results based on concentration of the measure that would reduce the bound on EConc(P). To see this, consider the example in Figure 2. The point p' is the center of a disc of
n
Figure 2: Example showing that Xi is not concentrated around its expectation.
radius n3/4, and we consider a circular sector with arc-length n1/4. This region is grey in the picture. Imagine that we place a densest 1-separated point set P inside the grey region. Asymptotically, since the region has area 0(n), such a point set P has 0(n) points. Consider the lines L(a + n/2) passing through pi. If a is chosen uniformly at random, the line L(a) intersects the grey region with probability n'^/4/(2nn;3/4) = e(l/Vn), and in that case Xi(a + n/2) = 0(n3/4). We conclude that E[Xi] = e(n1/4), but P[Xi = ü(n4/3)] = Q(1/^n), and so Xi does not concentrate around its expectation.
4 Second moments
Lemma 5. For every i e [n] we have E[X2] = O(n).
Proof. Assume without loss of generality that di,j > di,k whenever j > k; that is, the points are indexed according to their distance from pi. Like above, we assume that the line L(a) passes through pi. We have
E[Xi2 ] = E
<E
Xi,jXi,k
j,ke[n]
j k<j
= 2J2 E
k<j
XijT.k<j Xi,k = O(1), and so the
We claim that E result follows.
To prove the claim, observe that if Xi,j (a) = 1, then all the points pk that have Xi,k(a) = 1 need to be in the strip (or slab) of width two having L(a + n/2) as axis; see Figure 3, where this strip is in grey. Because of a packing argument, in this strip there are O(di,j) points pk that satisfy di,j > dikk. Therefore, by the way we indexed the points, we conclude that, if Xj,^ (a) = 1, then
Y.k<j XjA (a) = O(di,j ). In any case, we always have
XjjY.^, Xj,^ (a) = O(di,j). Therefore
E [ Xj,^ Xi,k t • P k<j
t=i
Xi,^ ^^ Xi,k = t k<j
^ O(di,j ) • P
t=1
Xi,j Xi,k = t k<j
= O(dijj ^ P
t=1
Xi,j Xi,k = t k<j
< O(dijj) • P [Xijj = 1]
= O(d.^jj )
2 arcsin 1/di,j
= O(1).
This finishes the proof of the claim and of the lemma. □
Theorem 1. For any 1-separated point set P we have
EConc(P ) = O(n2/3).
Proof. Let P be a 1-sepai-ated point set and consider the random variable T (a) = {J2 i X^) (a). By Lemma 5 we have E[T] = E[Xi2] = O(n2). The rest of the proof resembles the argument in the previous section.
We claim that, for any value t > 0, if we have Xmax(a) > t, then T (a) > t3/8. The proof is as follows. Let i be an index such that X, (a) > t. Then, either to the right or to the left of p* (a), the projection of p, onto L (a), there are at least t/2 points p*(a) at distance at most one from p*(a). Assume that those points are to the left and let pP C P be the set of those points. We have | > t/2. For any p j ,pj' e we have Xjj' (a) = 1. Therefore for all p j e P we have Xj (a) > t/2, and X2(a) > t2/4. We conclude that
T (a)	X2(a)
p j ^P
>	E t'/4 > t^/4 •IPI p j e P
>	t3/8,
and the claim is proved.
n

Figure 3: For the proof of Lemma 5. For any angle a we
have (Xi,i Ek<i Xi,^ (a) = O(di,i).
We have shown that for any value t > 0 we have [Xmax > t] C [T > t3/8], and using Markov's inequality we conclude
P[Xmax > t] < P[T > t3/8] < 8EtT < .
Let r = [n^/3\. Since Xm,ax only takes natural numbers, we have
E[Xmax]	P[Xmax > t]
t=l
r	n
= Y, P[Xmax > t]^ Y. P[Xmax > t]
t=l
t=r + l
< j: 1+ O'"2'
t=i
t3
t=r+l
" I
_ t3
/ 2 2 \
<	r + O(n2) J t3dt
<	n2/3 + O(n2) = O(n2/3).
yr2 n2y
E[X3] = 0(n2), and in general E[Xf] = 0(nP-1) for all naturals p > 2. From this we can only conclude weaker results of the type EConc(P) = O(nP/(P+l) ).
Conclusions
We have studied the expected concentration of projecting 1-separated point sets onto random lines, a parameter that is relevant for sweep-line algorithms when the direction for sweeping is chosen at random. We have shown that, if P consists of n points, the expected concentration EConc(P) is O(n'2/3), while the best known lower bound is n log n). Therefore, it remains to close this gap.
Acknowledgements
The authors are grateful to Jirf Matoušek for the key reference [4]. Sergio is also grateful to Christian Knauer for early discussions.
References
[1]	A. S. Besicovitch. The Kakeya problem. The American Mathematical Monthly, 70:697-706, 1963.
[2]	M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational Geometry: Algorithms and Applications. Springer-Verlag, Berlin, Germany, 2nd edition, 2000.
[3]	J.M. Diaz and F. Hurtado and M. López and J.A. Sell-arès. Optimal point set projections onto regular grids. In T. Ibaraki et al., editor, 14th Inter. Symp. on Algorithms and Computation, volume 2906 of LNCS, pages 270-279. Springer Verlag, 2003.
[4]	L. Kucera, K. Mehlhorn, B. Preis, and E. Schwarze-necker. Exact algorithms for a geometric packing problem. In Proc. 10th Sympos. Theoret. Aspects Comput. Sci., volume 665 of Lecture Notes Comput. Sci., pages 317-322. Springer-Verlag, 1993.
Using Lemma 2 it follows that EConc(P) = O(n2/3).
□
Trying to use the same ideas with higher moments of X' does not help. Consider for example the 1-separated point set P consisting of all n points in a horizontal row of length n, and let p1 be the leftmost point. We have
The Spectra of Knödel Graphs
Hovhannes A. Harutyunyan and Calin D. Morosan
Department of Computer Science
1455 de Maisonneuve Blvd. West, H3G1M8
Concordia University, Montreal, QC, Canada
E-Mail: haruty@cs.concordia.ca, cd_moros@cs.concordia.ca
Keywords: Knödel graphs, spectra of graphs, number of spanning trees Received: January 28, 2005
Knödel graphs Wd,n are regular graphs on n vertices and degree d. They have been introduced by W. Knödel and have been proved to be minimum gossip graphs and minimum broadcast graphs for d = [log n\. They became even more interesting in the light of recent results regarding the diameter, which is, up to now, the smallest known diameter among all minimum broadcast graphs on 2d vertices. Also, the logarithmic time routing algorithm that we have found, the bipancyclicity property, embedding properties and, nevertheless, Cayley graph structure, impel these graphs as good candidates for regular network constructions, especially in supercomputing. In this paper we describe an efficient way to compute the spectra of Knödel graphs using results from Fourier analysis, circulant matrices and PD-matrices. Based on this result we give a formula for the number of spanning trees in Knödel graphs.
Povzetek: Narejena je analiza Knödelovih grafov.
1 Introduction
2 Definitions and notations
Knödel graphs Wd,n, are regular graphs on even number of vertices n and degree d. They have been introduced by W. Knödel [10] and have been proved to be minimum gossip graphs and minimum broadcast graphs for degree
d = [log2 n\.
Recently, it has been proved in [7] that the Knödel graph Wd,2d on 2d vertices and degree d have diameter [d/2 + 1], which is the minimum known diameter among all minimum broadcast graphs on 2d vertices. We believe that this is also a lower bound on diameter for all regular graphs on 2d vertices and degree d. Also, the logarithmic time routing algorithm that we have found [9], the bipan-cyclicity property, embedding properties and, nevertheless, Cayley graph structure [6], impel these graphs as good candidates for regular network constructions, especially in supercomputing.
The goal of this study is to compute efficiently the spectra of Knödel graphs, first for Wd,2d, and then for arbitrary degree g and number of vertices n. We use results from Fourier analysis, circulant matrices and PD-matrices. Based on this result we give a formula for the number of spanning trees in Knödel graphs.
The paper is organized as follows: section 2 gives some definitions, section 3 extracts the general properties of the spectra, section 4 explains the method of computation, section 5 makes some remarks regarding the obtained spectra and section 6 establishes the number of spanning trees.
If we denote by A the adjacency matrix of a simple graph G, the set of eigenvalues of A, together with their multiplicities, is said to be the spectrum of G. If we denote by I the identity matrix, then the characteristic polynomial of G is defined as P (A) = det \XI - A|. The spectrum of G will be the set of solutions of the equation P (A) = 0.
Knowing the spectrum of a graph has a great impact on other characteristics of the graph. For example, the com-
n-1
plexity of a graph is k (G) = n (An - Ak), where n
k=i
is the number of eigenvalues, and An is the greatest eigenvalue.
Up to now, the spectra are known for some particular graphs: path, cycle, complete graph, complete bipartite graph, complete tree, hypercube, fc-dimensional lattice, star graph, etc. (see [4] and [8] for further references).
The Knödel graphs Wg,n are defined as G (V, E) with \V\ = n even, and the set of edges [6]:
E = {(i, j) I i + j =2k - 1 mod n}
(1)
where k = 1, 2,..., g, 0 < i, j < n — 1, 1 < g < [log2 n\.
We denote the adjacency matrix of an undirected graph by A = [aij], where 1 < i,j < \V\ = n, aij = 1 whenever vertex i is adjacent to vertex j, and 0 otherwise. If M is a matrix, we denote by MT the transpose of M, by M the complex conjugate of M, by M * the transpose complex conjugate of M, and by M-1 the inverse of M. We denote
by n a permutation:
1 2
a(1) a(2)
n a (n)
(2)
- In particular, for WV^^d, the number of distinct eigen-
values is at least [4].
+ 2 since the diameter is
+ 1
and by P (n) = (aij ) the corresponding permutation matrix of n, where ai,^(i) = 1 and ai,j=^(i) = 0.
If z G C, we denote by ž the complex conjugate of z, and by = \fžž the norm of z.
We denote by diag (Ai, A2,..., Xn) the diagonal matrix with the elements of the main diagonal (A1, A2,..., An).
Wedenote by circ (a1, a2,..., an) a circulant matrix with the first row (a1, a2,..., an). That is, the rest of the rows will be circular permutations of the first row toward right. Thus, it holds that ai,j — ai, i-j+i mod n. If the step of the shift is an integer q = 1, we call this matrix a (q) circulant matrix [12].
We denote by r the inverse permutation matrix, which is a (-1)circulant: r = (-1)circ (1,0,..., 0). An important property of r is that r2 = I, where I is the identity matrix. We denote by F the Fourier matrix, defined by its con-
jugate transpose F * = [w(i-1)(j-1^, 1 < i,j < n,
where w stands for the nth root of the unity [5]. Two important properties of F are: F * = F and FF * = I.
Other definitions and notations will follow in the places they are used.
3 General graph theory considerations
We observe that the adjacency matrix of the Knödel graphs is a (-l)circulant matrix, called also a retrocirculant [1], where all the rows are circular permutations of the first row toward left. For example, the adjacency matrix of W3,23 is:
4 Computing the spectrum of W^^d
According to [5], a matrix A is (-1)circulant if and only if A = F* (rA) F, where A = diag (y1,y2, ...,Yn). This relation can be transformed in FAF* = rA. That means that A and rA have the same eigenvalue set [5]. The second term is a PD-matrix, defined as a product of two matrices, P and D, where P is a permutation matrix and D is a diagonal matrix. The characteristic polynomial of a PD-matrix can be computed by decomposing the permutation P in prime cycles of total length n [5]. Since Knödel graphs adjacency matrices are (-V)circulants, the problem resumes now to that of determining the values of 71,72,..., Yn. Since rA has the form:
rA =
71 0 00
0
7n
(5)
V 0 72 ... 0 /
we can perform FAF * = [cij ] = rA and identify the terms c11 = 71, c2n = 7n, ..., cn2 = 72. In order to perform the triple matrix multiplication FAF*, we note that:
F = F * = 4=
w-(i-1)(j-1) mod n
(6)
Aw 3
W3 2.3
0	1	0	1	0	0	0	1
1	0	1	0	0	0	1	0
0	1	0	0	0	1	0	1
1	0	0	0	1	0	1	0
0	0	0	1	0	1	0	1
0	0	1	0	1	0	1	0
0	1	0	1	0	1	0	0
1	0	1	0	1	0	0	0
Since wn = 1 we may skip the modulo operations from the powers. Also, in order to avoid confusion with the summation indices, we emphasize the matrix indices. That is, (3) [a]^ j means that i is the row index and j is the column index, both varying from 1 to n.
Some general remarks can be made about the spectra of
W,
g,n-
All eigenvalues are real since the adjacency matrix is real and symmetric [3].
The maximum eigenvalue is An = g, since Wg,n is regular of degree g [2].
All eigenvalues are symmetric with respect to zero [11] since the Knödel graph is bipartite and its characteristic polynomial has the form:
P (A)= An + a2An-2 + ... + an-2A'2 + o^ (4)
w-(i-1)(fc-1)
FAF* = 1 n 1
n
i,k
[ak,m]
k,m
w
(m-1)(j-1)
J2w-(i-1)(k-1) ak,r^
1
n
Lk = 1
]rw-(i-1)(k-1) a1.m+k-1 k=1
w
(m-1)(j-1)


w(m-1)(j-1)
(7)

Since in the first row of the adjacency matrix only d values are nonzero, we can change the variable of summation in
n =
the first term of (7): k ^ r, where 1 < r < d. Therefore:
FAF * =
1
n
Y^W-^'-1^^2-m)
r=l
W
(m-l)(j-l)

1
n
1
n
1
n
n /d
E
m=1 \r=1
^w-(i-1)(2r-m)
„(m-1)(j-1)

nd
EE
m=1 r = 1
w-(i-1)(2r-m)W(m-1)(j-1)

„-(3-1)
WW
m=1
J2wm(3+i-2)J2'<
-2r(i-1)
r=1
i,3
Thus, for the general term of FAF * we obtain:
,-(3-1)

J2wm(3+i-2)J2'<
-2r (i-1)
m=1
r=1
Ip — Cn—p+2,p —
/	n	d	\
-(P-1^ E W-2r(n-(P-1))
\	m=1	r = 1	/
W
Bu^ wmn — n and w-2r(n-(p-1)) — w2r(p-1). Thus,
m=1
Yp — w-(P-1^ w2r (P-1)
r=1
n (r) —
(12 3 ... n/2+1 ... n 1 n n - 1 ... n/2+1 ... 2
S — {71, ^Y2 Yn, ±A/
n: ^V737n—1^
For the first eigenvalue we obtain:
d
Y1 — 1 — d r=1
Aitken proved in [1] that, for a (-1)circulant, 7n,/2+1 —
a1 — a2 + as —... — an, where (a1, a2,..., an) are the values of the first row of adjacency matrix. Thus:
7n/2+1 — E(—1)i+1 ai — E(—1)2+1 — —d (15) i=1	3=1
For the rest of the eigenvalues we have to evaluate products of the form: 7t7n-t+2, 2 < t < n/2. From (11) we have:
7t7n-t+2 — /
d
-(t-1^ w2r (t-1)
w
(8)
w
r=1 \ d\
-(n-t+1^ ^ w2r (n-t+1)
\
r=1
/
-(t-1)
Ew2r
(t-1)
r=1
(t-1)
w2 (n
(n-t+1)
r=1
(9)
J2w2(t-1^ w2(t-1) —
r=1
r=1
The general term of the rA matrix from (5) can be expressed as follows:
w2r
(t-1) —
r=1 d
(10)
r=1 2
w2 (t-1)
r=1
(16)
This confirms the well-known fact that all eigenvalues are real. Thus, the spectrum of W^ 2d is the set:
(11)
w2
r=1
2r (t-1)
(17)
On the other hand, r matrix corresponds to the permutation:
This permutation can be decomposed	in
n/2+1 prime cycles of total length n [5,	1]:
(1)(2,n)... (n/2,n/2+ 2) (n/2+1). Thus, the characteristic polynomial will be:
P (X) — (A — 71) (A2 — 727n) (A2 — 737n-1 ) ...
... (A2 — 7n/27n/2+2) (A — 7n/2+0	(12) The eigenvalues set will be:
S{Wa22^ — {±d}^± where 2 < t < n/2.
5 Observations
A. Not all eigenvalues are distinct. We can show that at most (n — 4)/2 of them are distinct. If we decompose the norm from (17) in its trigonometric form we obtain:
w2	(
(t-1)
r=1
d
Ecos2d 2r(t — 1)
r=1	2
+
d
2
r=1
sin-d 2r(t —1)
(18)
/
...j ^V7n/27n/2+2, 7n/2+1 } (13)
(14)
We observe that this norm evaluates to the same value for
t — n/4 + 1 — k, and t — n/4 + 1 + k:
d
^w2r (n/4+1-fc-1) r=1
2d V4 i,	2d
r=1
2^ J2d	. 2^ J2d \
--k
\4
r=1

/ d
.
/od \
V
y^co^ ^^ 2^ ^ + k
\r=1
d 2 (n/4+1+fc-1)
r=1
+
Es
\r=1
2^ 2r 2r
2d r+k
V
r=1 d
En
cos-2
+
d
n
\r = 1 /
/
\r=1
/
2
-1 + 1^^cos-2r + G^^sin-2r
r=3 ^ / V r=2 2
(d - 2)2
Thus, for Wd2d, the spectrum from (17) can be written as:
Sw,,, = {±d, ±(d - 2)} U

E
r=1
w
(t-1)
Sw,n = {±3} ul ±
w2
(t-1)
r=1
Sc2k = {±k} ^ ±
,j2(t-1) + w4(t-1)
w2(t-1) + w4(t-1)
/
^cos^J2(t - 1) + cos|f4(t - 1)^
2
+
sin^JJ 2(t - 1)+sin-J 4 (t - 1)
2n
cos I (t - 1)'
Thus, we meet the well-known result [2] that the spectrum of a cycle of length n is the set:
(19)
Scn ^ 2cos
2nj
1 < j < n
(25)
The computations for particular cases yield to the claim that these are the only overlapping eigenvalues. B. To our knowledge, there is no closed form for the sum from (16). Nevertheless, computations for particular cases suggest that, only for the particular value t = 2d/4 +1, the sum is evaluated to a closed form: 2
d
6 The number of spanning trees
An immediate consequence of the spectra of the Knödel graphs Wg,n is an O [ng^) formula for the number of spanning trees. It is well known that, given a graph G on n vertices and degree k, the number of spanning trees can be expressed as:
1 P-1
(G) = -n(k - ,
(26)
t=1
(20)
where \t are the eigenvalues, mt their multiplicities, and p the number of distinct eigenvalues [2]. Thus, for the particular case in which the degree is d and the number of vertices is 2d, using (21) we obtain:
« (Wd,2d ) =
d(2d - 2)
2d-2
1-2	/	d	2	2
	d2-	y^w2-(t-1)		(27)
=2	\	r=1	)	
(21)
where 2 < t < n/4 and the last set has multiplicity two. C. Note that in the formulas (7)- (16) we didn't make any assumptions regarding the number of vertices n nor the degree d. Therefore, the result from (17) can be extended in a similar manner for Knödel graphs with arbitrary degree g and number of vertices n, Wg^:
If we further decompose the norm from (27) in its trigonometric form, we obtain:
w2-(t-1)
2
(22)
where 2 < t < n/2.
For example, for W222k, which are cycles C2k of length 2J, applying (22), we obtain the spectrum:
E
r=1
'Ecos^d 2r (t-1^^sin0d 2r (t-1)
r=1	r=1 d d 2
d ^ E cos^d	- (t - 1)
i=1 j=i+1
(28)
(23)
Substituting this result in (27) and changing the variable t ^ t +1 we obtain for the number of spanning trees of
Wd,2d :
where 2 < t < 2J 1. The norm from (23) can be evaluated
to 2cos2n(t - 1)/2J as follows:
=
2d(d - 1)
2d-2
2d-^ 1
n (d2 - d - ^(t))2 (29)
t=1
where:
^(t)= EE cos (2^ - 2j ) t (30)
i=1j=i+1
(24)
In general, for Knödel graphs having arbitrary degree g and arbitrary number of vertices n, Wg,n, according to
«
2
4
(22), the number of spanning trees can be expressed as follows:
K (W9,n) = 2gU
n/2	/	g	2
n	2 g-		
t=2	\	r=1	/
(31)
A straightforward upper bound for the number of spanning trees of Knödel graphs Wg,n can be obtained cancelling the norm from (31):
E»2'
(t-i)
r = 1
=0
(32)
Therefore, for k (Wg,n) we obtain the upper bound:
K (Wg,n) <
2gn-1
(33)
[9] H. A. Harutyunyan and C. D. Morosan. On the minimum path problem in Knödel graphs. In Proceedings of the second international network optimization conference INOC2005, Lisbon, Portugal, pp. 43-48, 2005.
[10]	W. Knödel. New gossips and telephones. Discrete Mathematics, 13:95, 1975.
[11]	H. Sachs. Beziehungen zwischen den in einem graphen enthalten kreisen und seinem charakteristischen polynom. Publ. Math. Debrecen, 11:119-137, 1964.
[12]	K. Wang. On the generalisations of circulants. Linear algebra and its applications, 25:197-218, 1979.
Since, for Knödel graphs Wg^, the degree of a vertex g is upper bounded by [log2 n\ (see (1)), the bound from (33) can be expressed as follows:
K (Wg,n) <
References
2 [log2 n
n1
n
(34)
[1]	A. C. Aitken. Two notes on matrices. In Proc. Glasgow Math. Ass., pages 109-113, Glasgow Math. Ass., 1961-1962.
[2]	N. L. Biggs. Algebraic Graph Theory. Cambridge University Press, 1974.
[3]	F. Chatelin and M. Ahués. Eigenvalues of Matrices. Chichester, Wiley, New York, 1993.
[4]	D. M. Cvetkovic, M. Doob, and H. Sachs. Spectra of Graphs. Academic Press, New York, 1980.
[5]	P. J. Davis. Circulant Matrices. Chichester, Wiley, New York, 1979.
[6]	G. Fertin and A. Raspaud. A survey on Knödel graphs. Discrete Applied Mathematics, 137(2):173-195, 2004.
[7]	G. Fertin, A. Raspaud, O. Sykora, H. Schröder, and I. Vrto. Diameter of Knödel graph. In 26th International Workshop on Graph-Theoretic Concepts in Computer Science (WG 2000) in Lecture Notes in Computer Science, volume 1928, pages 149-160. Springer-Verlag, 2000.
[8]	C. Godsil and G. Royle. Algebraic Graph Theory. Springer-Verlag, New York, Berlin, Heidelberg, 2001.
2
On the Crossing Number of Almost Planar Graphs
Bojan Mohar
Faculty of Mathematics and Physics Department of Mathematics University of Ljubljana Jadranska 19 1000 Ljubljana, Slovenia E-mail: bojan.mohar@uni-lj.si
Keywords: planar graph, crossing number Received: May 8, 2005
If G is a plane graph and x,y e V (G), then the dual distance of x and y is equal to the minimum number of crossings of G with a closed curve in the plane joining x and y. Riskin [7] proved that if Go is a 3-connected cubic planar graph, and x, y are its vertices at dual distance d, then the crossing number of the graph Go + xy is equal to d. Riskin asked if his result holds for arbitrary 3-connected planar graphs. In this paper it is proved that this is not the case (not even for every 5-connected planar graph Go).
Povzetek: Analizirana je Riskinova teza o planarnih graßh.
1 Introduction
Crossing number minimization is one of the fundamental optimization problems in the sense that it is related to various other widely used notions. Besides its mathematical interest, there are numerous applications, most notably those in VLSI design [1,2, 3] and in combinatorial geometry [9]. We refer to [4, 8] and to [10] for more details about such applications.
A drawing of a graph G is a representation of G in the Euclidean plane R2 where vertices are represented as distinct points and edges by simple polygonal arcs joining points that correspond to their endvertices. A drawing is clean if the interior of every arc representing an edge contains no points representing the vertices of G. If interiors of two arcs intersect or if an arc contains a vertex of G in its interior we speak about crossings of the drawing. More precisely, a crossing of V is a pair ({e, f },p), where e and f are distinct edges and p e R2 is a point that belongs to interiors of both arcs representing e and f in V. If the drawing is not clean, then the arc of an edge e may contain in its interior a point p e R2 that represents a vertex v of G. In such a case, the pair ({v, e}, p) is also referred to as a crossing of V.
The number of crossings of V is denoted by cr (V) and is called the crossing number of the drawing V. The crossing number cr(G) of a graph G is the minimum cr ( V) taken over all clean drawings V of G.
A clean drawing V with cr(V) =0 is also called an embedding of G. By a plane graph we refer to a planar graph together with an embedding in the Euclidean plane. We shall identify a plane graph with its image in the plane.
A nonplanar graph G is almost planar if it contains an edge e such that G - e is planar. Such an edge e is called a
planarizing edge. It is easy to see that almost planar graphs can have arbitrarily large crossing number. In the sequel, we will consider almost planar graphs with a fixed planariz-ing edge e = xy, and will denote by Go = G - e the corresponding planar subgraph. By a plane graph we mean a planar graph together with its embedding in the plane.
Let G0 be a plane graph and let x,y be two of its vertices. A simple (polygonal) arc 7 : [0,1] ^ R2 is an (x, y)-arc if 7(0) = x and 7(1) = y. If Y(t) is not a vertex of G0 for every t, 0 <t < 1, then we say that 7 is clean. For an (x, y)-arc 7 we define the crossing number of 7 with G0 as
cr(7,Go) = \{t | Y(t) e Go and 0 <t< 1}|.
Using this notation, we define the dual distance
d*(x,y) = min{cr (7, Go) \ 7 is a clean (x, y)-arc}
and the facial distance between x and y,
d'(x, y) = min{cr(7, Go) \ 7 is an (x, y)-arc}.
Clearly, d'(x,y) < d*(x,y).
Let GX^ y be the geometric dual graph of the plane graph Go - x - y. Then d* (x, y) is equal to the distance in G*Xyy between the two vertices corresponding to the faces of Go -x - y containing x and y. This shows that d* (x, y) can be computed in linear time. Similarly, one can compute d'(x, y) in linear time by using the vertex-face incidence graph (see [6]).
Proposition 1.1. If Go is a planar graph and x,y e V(Go), then for every embedding of Go in the plane, we have cr(Go + xy) < d*(x, y).
Proposition 1.1 is clear from the definition of d*. It shows that it is of interest to determine the minimum
d* (x, y) taken over all embeddings of Go in the plane. We refer to [5] for more details and some further extensions.
Riskin [7] proved the following strengthening of Proposition 1.1 in a special case when Go is 3-connected and cubic:
Theorem 1.2. If Go is a 3-connected cubic planar graph, then
cr(Go + xy) = d'{x,y).
Let us observe that d'(x, y) = d*(x, y) if G0 is a cubic graph.
Riskin asked in [7] if Theorem 1.2 holds for arbitrary 3-connected planar graphs. In this paper we show that this is not the case (not even for every 5-connected planar graph
Go).
2 Strange examples
In this section we provide a negative answer to the aforementioned question of Riskin [7] who asked if it is true that for every 3-connected plane graph Go and any two of its vertices x, y, the crossing number of Go + xy equals
d*(x,y).
Theorem 2.1. For every integer k, there exists a 5-conn-ected planar graph G0 andtwo vertices x, y G V (G0) such that cr(G0 + xy) < 11 and d*(x,y) > k.
Figure 2: The graph Qk
6-gon aa'bb'ccC. Let us take 5 copies of the graph Gk and let ai,a'i,bi,b'i,ci, be copies of the corresponding vertices on the outer face of the ith copy of Gk, i = 1,.5. Let Qk be the planar graph obtained from these copies by cyclically identifying bi with ai+i, adding edges bici^^ (i = 1, . . . , 5, indices modulo 5), and adding two vertices x and y such that x is joined to a1,... ,a5 and y is joined to c1,...,c5. See Figure 2. The obtained graph Qk is planar and it is not difficult to verify that it is 5-connected.
It is easy to see that d*(x,y) = k + 2 in Qk .By putting the vertex x close to y, so that we can draw the edge xy without introducing crossings with other edges, and then redrawing the edges from x to its neighbors as shown in Figure 2, a drawing of Qk + xy is obtained whose crossing number is 11.	q
Figure 1: Part of the triangular lattice with side length 8
Proof. Let Hk be the planar graph that is obtained from the icosahedron by replacing all of its triangles, except one, with the dissection of the equilateral triangle with side of length k into equilateral triangles with sides of unit length (as shown in Figure 1 for k = 8). This graph is a near triangulation, all its faces are triangles, except one, whose length is 3k. We may assume that this is the outer face in a plane embedding of H k. Its boundary is composed of three paths A, B, C of length k joining the original vertices a',b', c of the icosahedron we started with. Now we add three new vertices, a, b, c and join a with all vertices on A, b with B, and c with C. This gives rise to a 5-connected near triangulation Gk whose outer face is the
"--ifish'
Figure 3: A planar graph for which two flips are needed
The construction of Theorem 2.1 can be generalized such that a similar redrawing as made above for x is necessary also for y (in order to bring these two vertices close together). Such an example is shown in Figure 3, where x
and y are vertices in the centers of the small circular grids on the picture, and where the bold lines represent a "thick" barrier similar to the one used in the graph Qk in Figure 2. In Figure 4, an optimum drawing of G0 + xy is shown, where the edge xy is represented by the broken line. In this drawing, neighborhoods of x and y, are redrawn inside the faces denoted by A and B (respectively) in Figure 3.
At the first sight the redrawing described in the above example seems like the worst possibility which may happen - to "flip" a part of the graph containing x and to "flip" a part containing y. If this would be the only possibility of making the crossing number smaller than the one coming from the planar drawing of Go, this would most likely give rise to a polynomial time algorithm for computing the crossing number of graphs that are just one edge away from a 3-connected planar graph.
[3]	F. T. Leighton, New lower bound techniques for VLSI, Math. Systems Theory 17 (1984) 47-70.
[4]	A. Liebers, Planarizing graphs—a survey and annotated bibliography, J. Graph Algorithms Appl. 5 (2001) 74 pp.
[5]	B. Mohar, Crossing number of almost planar graphs, preprint, 2005.
[6]	B. Mohar and C. Thomassen, Graphs on Surfaces, Johns Hopkins University Press, Baltimore, 2001.
[7]	A. Riskin, The crossing number of a cubic plane polyhedral map plus an edge, Studia Sci. Math. Hungar. 31 (1996) 405-413.
[8]	F. Shahrokhi, O. Sykora, L.A. Székely, I. Vrt'o, Crossing numbers: bounds and applications. Intuitive geometry (Budapest, 1995), 179-206, J. Bolyai Math. Soc., Budapest, 1997.
[9]	L.A. Székely, A successful concept for measuring non-planarity of graphs: the crossing number, Discrete Math. 276 (2004) 331-352.
[10]	I. Vrt'o, Crossing number of graphs: A bibliography.
ftp://ftp.ifi.savba.sk/ pub/imrich/crobib.pdf
Figure 4: An optimum drawing of G0 + xy
Unfortunately, some more complicated examples show that there are other ways for shortcutting the dual distance from x to y. (Such an example was produced in a discussion with Thomas Böhme and Neil Robertson whose help is greatly acknowledged.) Despite such examples, the following question may still have a positive answer:
Problem 2.2. Is there a polynomial time algorithm which would determine the crossing number of Go + xy if Go is planar.
Acknowledgement
Supported in part by the Ministry of Higher Education, Science and Technology of Slovenia, Research Program P1-0507-0101 and Research Project L1-5014-0101.
References
[1] S.N. Bhatt, F.T. Leighton, A framework for solving VLSI graph layout problems, J. Comput. System Sci. 28 (1984) 300-343.
[2] F. T. Leighton, Complexity Issues in VLSI, MIT Press, Cambridge, Mass., 1983.
Unsupervised Feature Extraction for Time Series Clustering Using Orthogonal Wavelet Transform
Hui Zhang and Tu Bao Ho
School of Knowledge Science, Japan Advanced Institute of Science and Technology, Asahidai, Nomi, Ishikawa 923-1292. Japan E-mail: {zhang-h,bao}@jaist.ac.jp
Yang Zhang
Department of Avionics, Chengdu Aircraft Design and Research Institute, No. 89 Wuhouci street, Chendu, Sichuan 610041. P.R. China E-mail: v104@sohu.com
Mao-Song Lin
School of Computer Science, Southwest University of Science and Technology, Mianyang, Sichuan 621002. P.R. China E-mail: lms@swust.edu.cn
Keywords: time series, data mining, feature extraction, clustering, wavelet Received: September 4, 2005
Time series clustering has attracted increasing interest in the last decade, particularly for long time series such as those arising in the bioinformatics and financial domains. The widely known curse of dimensionality problem indicates that high dimensionality not only slows the clustering process, but also degrades it. Many feature extraction techniques have been proposed to attack this problem and have shown that the performance and speed of the mining algorithm can be improved at several feature dimensions. However, how to choose the appropriate dimension is a challenging task especially for clustering problem in the absence of data labels that has not been well studied in the literature.
In this paper we propose an unsupervised feature extraction algorithm using orthogonal wavelet transform for automatically choosing the dimensionality of features. The feature extraction algorithm selects the feature dimensionality by leveraging two conflicting requirements, i.e., lower dimensionality and lower sum of squared errors between the features and the original time series. The proposed feature extraction algorithm is efficient with time complexity O{mn) when using Haar wavelet. Encouraging experimental results are obtained on several synthetic and real-world time series datasets.
Povzetek: (Jlanek analizira pomembnost atributov pri grupiranju časovnih vrst.
1 Introduction	far away. But in high dimensional spaces the contrast between the nearest and the farthest neighbor gets increas-
Time series data are widely existed in various domains,	ingly smaller, making it difficult to find meaningful groups
such as financial, gene expression, medical and science.	[6]. Thus high dimensionality normally decreases the per-
Recently there has been an increasing interest in mining	formance of clustering algorithms. this sort of data. Clustering is one of the most frequently
used data mining techniques, which is an unsupervised	Data Dimensionality Reduction aims at mapping high-
learning process for partitioning a dataset into sub-groups	dimensional patterns onto lower-dimensional patterns. so that the instances within a group are similar to each other Techniques for dimensionality reduction can be classified
and are very dissimilar to the instances of other groups.	into two groups: feature extraction and feature selection
Time series clustering has been successfully applied to var-	[34]. Feature selection is a process that selects a subset ious domains such as stock market value analysis and gene of original attributes. Feature extraction techniques extract
function prediction [17, 22]. When handling long time se-	a set of new features from the original attributes through ries, the time required to perform the clustering algorithm some functional mapping [43]. The attributes that are im-
becomes expensive. Moreover, the curse of dimensional-	portant to maintain the concepts in the original data are se-
ity, which affects any problem in high dimensions, causes	lected from the entire attribute sets. For time series data, highly biased estimates [5]. Clustering algorithms depend the extracted features can be ordered in importance by us-on a meaningful distance measure to group data that are ing a suitable mapping function. Thus feature extraction is close to each other and separate them from others that are much popular than feature selection in time series mining
community.
Many feature extraction algorithms have been proposed for time series mining, such as Singular Value Decomposition (SVD), Discrete Fourier Transform (DFT), and Discrete Wavelet Transform (DWT). Among the proposed feature extraction techniques, SVD is the most effective algorithm with minimal reconstruction error. The entire time-series dataset is transformed into an orthogonal feature space in that each variable are orthogonal to each other. The time-series dataset can be approximated by a low-rank approximation matrix by discarding the variables with lower energy. Korn et al. have successfully applied SVD for time-series indexing [31]. It is well known that SVD is time-consuming in computation with time complexity O(mn2), where m is the number of time series in a dataset and n is the length of each time series in the dataset. DWT and DFT are powerful signal processing techniques, and both of them have fast computational algorithms. DFT maps the time series data from the time domain to the frequency domain, and there exists a fast algorithm called Fast Fourier Transform (FFT) that can compute the DFT coefficients in O(mnlogn) time. DFT has been widely used in time series indexing [4, 37, 42]. Unlike DFT, which takes the original time series from the time domain and transforms it into the frequency domain, DWT transforms the time series from time domain into time-frequency domain.
Since the wavelet transform has the property of time-frequency localization of the time series, it means most of the energy of the time series can be represented by only a few wavelet coefficients. Moreover, if we use a special type of wavelet called Haar wavelet, we can achieve O( mn) time complexity that is much efficient than DFT. Chan and Fu used the Haar wavelet for time-series classification, and showed performance improvement over DFT [9]. Popivanov and Miller proposed an algorithm using the Daubechies wavelet for time series classification [36]. Many other time series dimensionality reduction techniques also have been proposed in recent years, such as Piecewise Linear Representation [28], Piecewise Aggregate Approximation [25, 45], Regression Tree [18], Symbolic Representation [32]. These feature extraction algorithms keep the features with lower reconstruction error, the feature dimensionality is decided by the user given approximation error. All the proposed algorithms work well for time series with some dimensions because the high correlation among time series data makes it possible to remove huge amount of redundant information. Moreover, since time series data are normally embedded by noise, one byproduct of dimensionality reduction is noise shrinkage, which can improve the mining quality.
However, how to choose the appropriate dimension of the features is a challenging problem. When using feature extraction for classification with labeled data, this problem can be circumvented by the wrapper approach. The wrapper approach uses the accuracy of the classification algorithm as the evaluation criterion. It searches for features better suited to the classification algorithm aiming to
improve classification accuracy [30]. For clustering algorithms with unlabeled data, determining the feature dimensionality becomes more difficult. To our knowledge, automatically determining the appropriate feature dimensionality has not been well studied in the literature, most of the proposed feature extraction algorithms need the users to decide the dimensionality or give the approximation error. Zhang et al. [46] proposed an algorithm to automatically extract features from wavelet coefficients using entropy. Nevertheless, the length of the extracted features is the same with the length of the original time series that can't take the advantage of dimensionality reduction. Lin et al. [33] proposed an iterative clustering algorithm exploring the multi-scale property of wavelets. The clustering centers at each approximation level are initialized by using the final centers returned from the coarser representation. The algorithm can be stopped at any level but the stopping level should be decided by the user. There are several feature selection techniques for clustering have been proposed [12, 15, 41]. However, these techniques just order the features in the absence of data labels, the appropriate dimensionality of features still need to be given by the user.
In this paper we propose a time-series feature extraction algorithm using orthogonal wavelet for automatically choosing feature dimensionality for clustering. The problem of determining the feature dimensionality is circumvented by choosing the appropriate scale of the wavelet transform. An ideal feature extraction technique has the ability to efficiently reduce the data into a lower-dimensional model, while preserving the properties of the original data. In practice, however, information is lost as the dimensionality is reduced. It is therefore desirable to formulate a method that reduces the dimensionality efficiently, while preserving as much information from the original data as possible. The proposed feature extraction algorithm leverages the lower dimensionality and lower errors by selecting the scale within which the detail coefficients have lower energy than that within the nearest lower scale. The proposed feature extraction algorithm is efficient that can achieve time complexity O( mn) with Haar wavelet.
The rest of this paper is organized as follows. Section 2 gives the basis for supporting our feature extraction algorithm. The feature extraction algorithm and its time complexity analysis are introduced in Section 3. Section 4 contains a comprehensive experimental evaluation of the proposed algorithm. We conclude the paper by summarizing the main contributions in Section 5.
2 The basis of the wavelet-based feature extraction algorithm
Section 2.1 briefly introduces the basic concepts of wavelet transform. The properties of wavelet transform supporting our feature extraction algorithm are given in Section 2.2. Section 2.3 presents the Haar wavelet transform algorithm
used in our experiments.
2.1 Orthogonal Wavelet Transform Background
Wavelet transform is a domain transform technique for hierarchically decomposing sequences. It allows a sequence to be described in terms of an approximation of the original sequence, plus a set of details that range from coarse to fine. The property of wavelets is that the broad trend of the input sequence is preserved in the approximation part, while the localized changes are kept in the detail parts. No information will be gained or lost during the decomposition process. The original signal can be fully reconstructed from the approximation part and the detail parts. The detailed description of wavelet transform can be found in [13, 10].
The wavelet is a smooth and quickly vanishing oscillating function with good localization in both frequency and time. A wavelet family is the set of functions generated by dilations and translations of a unique mother wavelet
=	t - k), J, k e Z
A function ^ e L2 (R) is an orthogonal wavelet if the family is an orthogonal basis of L2(R), that is
< , >= Sj,l ■ Sk,m, J, k,l,m e Z
where <	> is the inner product of and
and ói,j is the Kronecker delta defined by
=
0,	for i = J
1,	for i = j
Any function f (t) e L2(R) can be represented in terms of this orthogonal basis as
f (t) = Y. cj,k ^j,k (t)
(1)
j,k
part and the details is determined by the resolution, that is, by the scale below which the details of a signal cannot be discerned. At a given resolution, a signal is approximated by ignoring all fluctuations below that scale. We can progressively increase the resolution; at each stage of the increase in resolution finer details are added to the coarser description, providing a successively better approximation to the signal.
A MBA of L2(R) is a chain of subspace {Vj : J e Z} satisfying the following conditions [35]:
(i)	... C V_2 C V_1 C Vo C Vi... C L2(R)
(ii)	lljez Vj = {0}^jez Vj = L2(R)
(iii)	f (t) e Vj ^^ f (2t) e Vj+i ; v j e Z
(iv)	3^(t), called scaling function, such that {^(t - k) : k e Z} is an orthogonal basis of Vo.
Thus ^j,k(t) = 2j/2^(2jt - k) is the orthogonal basis of Vj. Consider the space i, which is an orthogonal complement of Vj-i in Vj : Vj = Vj-^^ Wj-i. By defining the ^jk form the orthogonal basis of Wj, the basis
{^jk ; j e Z, k e Z}
spans the space V
j
= Vj
(3)
and the cjkk =< (t),f (t) > are called the wavelet coefficients of f (t).
Parsevel's theorem states that the energy is preserved under the orthogonal wavelet transform, that is,
E I < f(t),^j,k > l2 = ^f(t)\\2, f (t) e L2(R) (2)
j,kez
(Chui 1992, p. 226 [10]). If f (t) be the Euclidean distance function, Parsevel's theorem also indicates that f (t) will not change by the orthogonal wavelet transform. The distance preserved property makes sure no false dismissal will occur with distance based learning algorithms [29].
To efficiently calculate the wavelet transform for signal processing, Mallat introduced the Multiresolution Analysis (MRA) and designed a family of fast algorithms based on it [35]. The advantage of MRA is that a signal can be viewed as composed of a smooth background and fluctuations or details on top of it. The distinction between the smooth
Notice that because Wj — i is orthogonal to Vj-i, the ^ is orthogonal to
For a given signal f (t) e L2(R) one can find a scale J such that fj e Vj approximates f (t) up to predefined precision. If dj-i e Wj-i, fj-i e Vj-i, then fj is decomposed into {fj-i, dj-i}, where fj-i is the approximation part of fj in the scale J - 1 and dj-i is the detail part of fj in the scale J - 1. The wavelet decomposition can be repeated up to scale 0. Thus fj can be represented as a series {f0, d0, di,..., dj-i} in scale 0.
2.2 The Properties of Orthogonal Wavelets for Supporting the Feature Extraction Algorithm
Assume a time series X (X e R") is located in the scale J. After decomposing X at a specific scale J (J e [0,1,...,J - 1]), the coefficients Hj (J() corresponding to the scale J can be represented by a series {Aj,Dj,...,Dj-i}. The Aj are called approximation coefficients which are the projection of X in Vj and the D j,..., Dj-i are the wavelet coefficients in Wj,..., WJ-i representing the detail information of X. From a single processing point of view, the approximation coefficients within lower scales correspond to the lower frequency part of the signal. As noise often exists in the high
frequency part of the signal, the first few coefficients of H J (X ), corresponding to the low frequency part of the signal, can be viewed as a noise-reduced signal. Thus keeping these coefficients will not lose much information from the original time series X. Hence normally the first k coefficients of Ho(X ) are chosen as the features [36, 9]. We keep all the approximation coefficients within a specific scale j as the features which are the projection of X in Vj. Note that the features retain the entire information of X at a particular level of granularity. The task of choosing the first few wavelet coefficients is circumvented by choosing a particular scale. The candidate selection of feature dimensions is reduced from [1, 2,..., n] to [20, 2^,..., 2J-1].
Definiton 2.1. Given a time series X e R", the features are the Haar wavelet approximation coefficients Aj decomposed from X within a specific scale j, j e [0,1,..., J -1].
The extracted features should be similar to the original data. A measurement for evaluating the similarity/dissimilarity between the features and the data is necessary. We use the widely used sum of squared errors (square of Euclidean distance) as the dissimilarity measure between a time-series and its approximation.
Definiton 2.2. Given a time-series Ji e R", let Ji e R" denote any approximation of X, the sum of squared errors
(SSE) between XC and XC is defined as
SSE('X, JX- Xi)^
(4)
i=i
E(X )^(xi)'
i=i
(5)
Definiton 2.4. Given a time-series X e R" and its features A j e R™, the energy difference (ED) between X and Aj is
"	m
ED(J(,-Cj)= E(U)-E(-Cj)^(xi)' ^(aj)2 (6)
The X can be reconstructed by padding zeros to the end of Aj to make sure the length of padded series is the same as that of X and preforming the reconstruction algorithm with the padded series. The reconstruction algorithm is the reverse process of decomposition [35]. An example of Jk, and the Ji reconstructed from Ä^ using Haar wavelet transform for a time series located in scale 7 is shown in Figure 1. From Eq. 2 we know the wavelet transform is energy preserved, thus the energy of approximation coefficients within the scale j is equal to that of their reconstructed approximation series, i.e., E(X) = E(Aj). As mentioned in Section 2.1, Vj = Vj0 Wj-i, we have E(1x) = E(Äjk^1) + E(Dj_1). When decomposing the X to a scale j, from Eq. 3, we have E(X) = E(^i ) + Y.J=-1 E (D j ). Therefore, the energy difference
between the Aj and X is the sum of the energy of wavelet coefficients located in the scale j and scales higher than j, i.e., ED('Xt,-ij) = E(DdI).
The Hj(X) is {Aj,Dj,...,Dj-i} and Hj(X) is {Aj, 0,..., 0}. Since Euclidean distance also preserved with orthogonal wavelet transform, we have
SSE (X,X ) = SSE (Hj (X ),Hj (X )) = j: ',=-1 E (Di).
Therefore, the energy difference between X and X is equal to that between Aj and X, that is
SSE(X,X )= ED(X,Aj )
(7)
Since the length of the features corresponding to scale j is smaller than the length of X, we can't calculate the SSE between X and Aj by Eq. (4) directly. One choice is to
reconstruct a sequence Ji e R" from -Lj then calculate the SSE between X and X. For instance, Kaewpijit et al. [23] used the correlation function of X and X to measure the similarity between X and A j. Actually, SSE (X, X ) is the same as energy difference between Aj and X with orthogonal wavelet transform. This property makes it possible to design an efficient algorithm without reconstructing
^.
Definiton 2.3. Given a time-series X e R", the energy of ^J is:
"
^2
i=i
i=i
Figure 1: An example of approximation coefficients and their reconstructed approximation series
2.3 Haar Wavelet Transform
We use the Haar wavelet in our experiments which has the fastest transform algorithm and is the most popularly used
orthogonal wavelet proposed by Haar. Note that the properties mentioned in Section 2.2 are hold for all orthogonal wavelets such as the Daubechies wavelet family. The concrete mathematical foundation of the Haar wavelet can be found in [7]. The length of an input time series is restricted to an integer power of 2 in the process of wavelet decomposition. The series will be extended to an integer power of 2 by padding zeros to the end of the time series if the length of input time series doesn't satisfy this requirement.
The Haar wavelet has the mother function
1, if 0 <t< 0.5
^Haar (t) H "1, if 0.5 <t< 1
and scaling function
^Haar (t) =
0, otherwise
1, for 0 < t< 1 0, otherwise
A time-series X = {:ci,:c2,... ,:cn} located in the scale J = log2 (n) can be decomposed into an approximation part = {(xi + x2)^V2, (x3 + x^)^!,(x„_i + xn)^2} and a detail part Dj-1 = {(x1 " x2)^l, (x3 " ..(xn-1 " xn)^V2}. The approximation coefficients and wavelet coefficients within a particular scale j, Aj and Dj, both having length n/2J, can be decomposed from Aj+1, the approximation coefficients within scale j + 1 recursively. The ith element of Aj is calculated
as:
aj =	+ af+1), i G [1, 2, .. ., n/2J-j]
The ith element of D j is calculated as:
(8)
dj = -^.(af- " ^ 1), i G [1, 2,..., n/2J-j] (9)
Ao has only one element denoting the global average of X. The ith element of A j corresponds to the segment in the series X starting from position (i " 1) * 2J-j + 1 to position i * 2J -j. The a^j is proportional to the average of this segment and thus can be viewed as the approximation of the segment. It's clear that the approximation coefficients within different scales provide an understanding of the major trends in the data at a particular level of granularity.
The reconstruction algorithm just is the reverse process of decomposition. The Aj+i can be reconstructed by formula (10) and (11).
^ = -^^(aj + dj),i G [1, 2,. .., n/2J-j] (10)
af+1 = ^^^(^j " dj ),i G [1, 2,...,n/2J-j ] (11)
3 Wavelet-based feature extraction algorithm
3.1	Algorithm Description
For a time-series, the features corresponding to higher scale keep more wavelet coefficients and have higher dimensionality than that corresponding to lower scale. Thus the
SSE(X, X) corresponding to the features located in different scales will monotonically increase when decreasing the scale. Ideal features should have lower dimensionality and lower SSE(X, X) at the same time. But these two objectives are in conflict. Rate distortion theory indicates that a tradeoff between them is necessary [11]. The traditional rate distortion theory determines the level of inevitable expected distortion, D, given the desired information rate R, in terms of the rate distortion function R(D).
The SSE(X, X) can be viewed as the distortion D. However, we hope to automatically select the scale without any user set parameters. Thus we don't have the desired information rate R, in this case the rate distortion theory can't be used to solve our problem.
As mentioned in the Section 2.2, the SSE(X, X) is equal to the sum of the energy of all removed wavelet coefficients. For a time series dataset having m time series, when decreasing the scale from the highest scale to scale 0, discarding the wavelet coefficients within a scale with lower energy ratio ^™ E(D^j)/Em EO^j-i E(D^)) will not decrease th^ E m SSE much. If a scale j satisfies E m E(Dj ) < Y. m E (Dj-i), removing the wavelet coefficients within this scale and higher scales achieves a local tradeoff of lower D and lower dimensionality for the dataset. In addition, from a noise reduction point of view, the noise normally found in wavelet coefficients within higher scales (high frequency part), and the energy of that noise is much smaller than that of the true signal with wavelet transform [14]. If the energy of the wavelet coefficients within a scale is small, their will be a lot of noise embedded in the wavelet coefficients; discarding the wavelet coefficients within this scale can remove more noise.
Based on the above reasoning, we leverage the two conflicted objectives by stopping the decomposition process at the scale j* " 1, when Em	> Em E(D^).
The scale j* is defined as the appropriate scale and the features corresponding to the scale j* are kept as the appropriate features. Note that by this process, at least Dj-1 will be removed, and the length of DJ-1 is n/2 for Haar wavelet. Hence the dimensionality of the features will smaller than or equal to n/2. The proposed feature extraction algorithm is summarized in pseudo-code format in Algorithm 1.
3.2	Time Complexity Analysis
The time complexity of Haar wavelet decomposition for a time-series is 2(n " 1) bound by O(^n) [8]. Thus for a time-
Algorithm 1 The feature extraction algorithm
Input: a set of time-series {Xi,X2,..., Xm} for i=1 to m do
calculate Aj-i and Dj-i for Xi end for
calculate m E(D1I) exitFlag = true for j=J-2 to 0 do for i=1 to m do calculate Aj and Dj for Xi
end for
calculate
E (Dj )
i^m E(Dj) ^m E(Dj+i) then keep all the Aj+i as the appropriate features for each time-series exitFlag = false break end if end for
if exitFlag then keep all the Ao as the appropriate features for each time-series end if
series dataset having m time-series, the time complexity of decomposition is m * 2(n - 1). Note that the feature extraction algorithm can break the loop before achieving the lowest scale. We just analyze the extreme case of the algorithm with highest time complexity (the appropriate scale j = 0). When j = 0, the algorithm consists of the following sub-algorithms:
-	Decompose each time-series in the dataset until the lowest scale with time complexity m * (2n — 1);
-	Calculate the energy of wavelet coefficients with time complexity m * ( n - 1) ;
-	Compare th^ m E (D j ) of different scales with time complexity log2(n).
The time complexity of the algorithm is the sum of the time complexity of the above sub-algorithms bounded by
O(mn).
4 Experimental evaluation
We use subjective observation and five objective criteria on nine datasets to evaluate the clustering quality of the K-means and hierarchical clustering algorithm [21]. The effectiveness of the feature extraction algorithm is evaluated by comparing the clustering quality of extracted features to the clustering quality of the original data. We also compared the clustering quality of the extracted appropriate features with that of the features located in the scale prior to the appropriate scale (prior scale) and the scale posterior
to the appropriate scale (posterior scale). The efficiency of the proposed feature extraction algorithm is validated by comparing the execution time of the chain process that performs feature extraction firstly then executes clustering with the extracted features to that of clustering with original datasets directly.
4.1 Clustering Quality Evaluation Criteria
Evaluating clustering systems is not a trivial task because clustering is an unsupervised learning process in the absence of the information of the actual partitions. We used classified datasets and compared how good the clustered results fit with the data labels which is the most popular clustering evaluation method [20]. Five objective clustering evaluation criteria were used in our experiments: Jac-card, Rand and FM [20], CSM used for evaluating time series clustering algorithms [44, 24, 33], and NMI used recently for validating clustering results [40, 16].
Consider G = G1,G2,.. ., Gm as the clusters from a supervised dataset, and A = A1,A2,..., Am as that obtained by a clustering algorithm under evaluations. Denote D as a dataset of original time series or features. For all the pairs of series (Di, D j ) in D, we count the following quantities:
-	a is the number of pairs, each belongs to one cluster in G and are clustered together in A.
-	b is the number of pairs that are belong to one cluster in G, but are not clustered together in A.
-	c is the number of pairs that are clustered together in A, but are not belong to one cluster in G.
-	d is the number of pairs, each neither clustered together in A, nor belongs to the same cluster in G.
The used clustering evaluation criteria are defined as below:
1. Jaccard Score (Jaccard):
n
Jaccard =
a + b + c
2. Rand statistic (Rand):
Rand =
a + d
a + b + c + d 3. Folkes and Mallow index (FM):
FM =
a + b a + c
4. Cluster Similarity Measure (CSM) :
The cluster similarity measure is defined as:
1 M
CSM(G,A) = ^y^ max Sim(Gi,Aj)
^ ' ' M^ i<j<M ^ ^ j
i=i
a
a
where
Sim{Gi,Aj ) =
2\Gi n Aj
|Gi| + \Aj \ 5. Normalized Mutual Information (NMI):
NMI =
Efj^r=i Nijlog() EZi \Gi\log^^)(EM1i\Aj\log^Nj)
where N is the number of time series in the dataset, \Gi \ is the number of time series in cluster Gj, \ Aj \ is the number of time series in cluster A j, and Ni j =
\Gi n Aj\.
All the used clustering evaluation criteria have value ranging from 0 to 1, where 1 corresponds to the case when G and A are identical. A criterion value is the bigger, the more similar between A and G. Thus, we prefer bigger criteria values. Each of the above evaluation criterion has its own benefit and there is no consensus of which criterion is better than other criteria in data mining community. To avoid biassed evaluation, we count how many times the evaluation criteria values produced from features are bigger/equal/smaller than that obtained from original data and draw conclusions based on the counted times.
4.2 Data Description
We used five datasets (CBF, CC, Trance, Gun and Reality) from the UCR Time Series Data Mining Archive [26]. (There are six classified datasets in the archive. The Aus-lan data is a multivariate dataset with which we can't apply the clustering algorithm directly. We used all the other five datasets for our experiments.) Other four datasets are downloaded from the Internet. The main features of the used datasets are described as below.
-	Cylinder-Bell-Funnel (CBF): Contains three types of time series: cylinder (c), bell (b) and funnel (f). It is an artificial dataset original proposed in [38]. The instances are generated using the following functions:
c(t) = (6 + n) • Xla,b](t) + £(t)
b(t) = (6 + n) • X[a,b] (t - a)/(b - a)+ e(t)
f (t) = (6 + n) • X[a,b] (b - t)/(b - a)+ e(t)
where
i 0, if t <a V t>b 11, if a < t < b
n and e(t) are drawn from a standard normal distribution N(0,1), a is an integer drawn uniformly from the range [16,32] and b - a is an integer drawn uniformly from the range [32,96]. The UCR Archive provides the source code for generating the samples. We generated 128 samples for each class with length 128.
-	Control Chart Time Series (CC): This dataset has 100 instances for each of the six different classes of control charts.
-	Trace dataset (Trace): The 4-class dataset contains 200 instances, 50 for each class. The dimensionality of the data is 275.
-	Gun Point dataset (Gun): The dataset has two classes, each contains 100 instances. The dimensionality of the data is 150.
-	Reality dataset (Reality): The dataset consists of data from Space Shuttle telemetry, Exchange Rates and artificial sequences. The data is normalized so that the minimum value is zero and the maximum is one. Each cluster contains one time series with 1000 datapoints.
-	ECG dataset (ECG): The ECG dataset was obtained from the ECG database at PhysioNet [19]. We used 3 groups of those ECG time-series in our experiments: Group 1 includs 22 time series representing the 2 sec ECG recordings of people having malignant ventricular arrhythmia; Group 2 consists 13 time series that are 2 sec ECG recordings of healthy people representing the normal sinus rhythm of the heart; Group 3 includes 35 time series representing the 2 sec ECG recordings of people having supraventricular arrhythmia.
-	Personal income dataset (Income): The personal income dataset [1] is a collection of time series representing the per capita personal income from 19291999 in 25 states of the USA1. The 25 states were partitioned into two groups based on their growing rate: group 1 includes the east coast states, CA and IL in which the personal income grows at a high rate; the mid-west states form a group in which the personal income grows at a low rate is called group 2.
-	Temperature dataset (Temp): This dataset is obtained from the National Climatic Data Center [2]. It is a collection of 30 time series of the daily temperature in year 2000 in various places in Florida, Tennessee and Cuba. It has temperature recordings from 10 places in Tennessee, 5 places in Northern Florida, 9 places in Southern Florida and 6 places in Cuba. The dataset is grouped basing on geographically distance and similar temperature trend of the places. Tennessee and Northern Florida form group 1. Cuba and South Florida form group 2.
-	Population dataset (Popu): The population dataset is a collection of time series representing the population estimates from 1900-1999 in 20 states of USA [3]. The 20 states are partitioned into two groups based on their trends: group 1 consists of CA, CO, FL, GA, MD, NC, SC, TN, TX, VA, and WA having the exponentially increasing trend while group 2 consists of IL, MA, MI, NJ, NY, OK, PA, ND, and SD having a stabilizing trend.
1The 25 states included were: CT, DC, DE, FL, MA, ME, MD, NC, NJ, NY, PA, RI, VA, VT, WV, CA, IL, ID, IA, IN, KS, ND, NE, OK, SD.
As Gavrilov et al. [17] did experiments showing that normalization is suitable for time series clustering, each time series in the datasets downloaded from the Internet (ECG, Income, Temp, and Popu) are normalized by formula xi = (xi - jj)/a, i e [1, 2,... ,n].
4.3 The Clustering Performance Evaluation
We took the widely used Euclidean distance for K-means and hierarchical clustering algorithm. As the Reality dataset only has one time series in each cluster that is not suitable for K-means algorithm, it was only used for hierarchical clustering. Since the clustering results of K-means depend on the initial clustering centers that should be randomly initialized in each run, we run K-means 100 times with random initialized centers for each experiment. Section 4.3.1 gives the energy ratio of wavelet coefficients within various scales and the calculated appropriate scale for each used dataset. The evaluation of K-means clustering algorithm with the proposed feature extraction algorithm is given in Section 4.3.2. Section 4.3.3 describes the comparative evaluation of hierarchical clustering with the feature extraction algorithm.
4.3.1	The Energy Ratio and Appropriate Scale
Table 1 provides the energy ratio Em E(D^j)/Em EJ-1 E(D^j) (in proportion to the energy) of wavelet coefficients within various scales for all the used datasets. The calculated appropriate scales for the nine datasets using Algorithm 1 are shown in Table 2. The algorithm stops after the first iteration (scale = 1) for most of the datasets (Trace, Gun, Reality, ECG, Popu, and Temp), and stops after the second iteration (scale = 2) for CBF and CC datasets. The algorithm stops after the third iteration (scale = 3) only for Income dataset. If the sampling frequency for the time series is f, wavelet coefficients within scale J correspond to the information with frequency f/2J. Table 2 shows that most used time series datasets have important frequency components beginning from f/2 or f/4.
4.3.2	K-means Clustering with and without Feature Extraction
The average execution time of the chain process that first executes feature extraction algorithm then performs K-means with the extracted features (termed by FE + K-means) with 100 runs and that of performing K-means directly on the original data (termed by K-means) with 100 runs are illustrated in Figure 2. The chain process executes faster than K-means with original data for the used eight datasets.
Table 3 describes the mean of the evaluation criteria values of 100 runs for K-means with original data. Table 4 gives the mean of the evaluation criteria values of 100 runs for K-means with extracted features.
K-means FE + K-menas
ECG Income
ILl
Popu
In In
ICL
6
7
Figure 2: The average execution time (s) of the K-means + FE and K-means algorithms for eight datasets
To compare the difference between the mean obtained from 100 runs of K-means with extracted features and that obtained from 100 runs of K-means with corresponding original data, two-sample Z-test or two-sample t-test can be used. We prefer two-sample t-test because it is robust with respect to violation of the assumption of equal population variances, provided that the number of samples are equal [39]. We use two-sample t-test with the following hypothesis:
Ho : ji = j2 Hi : j i = j 2
where j i is the mean of the evaluation criteria values corresponding to original datasets and j 2 is that corresponding to extracted features. The significance level is set as 0.05. When the null hypothesis (H0) is rejected, we conclude that the data provide strong evidence that j i is different with j2, and which item is bigger can be easily gotten by comparing the corresponding mean values as shown in Table 3 and Table 4. We list the results of t-tests in Table 5 (If the mean of the values of a criterion corresponding to extracted features is significantly bigger than that corresponding to the original data, we set the character as ' >'; if the mean of the values of a criterion corresponding to extract features is significantly smaller than that corresponding to the original data, the character is set as '<'; otherwise we set the character as '='). Table 5 shows that the evaluation criteria values corresponding to extracted features are bigger than that corresponding to the original data eleven times, smaller than that corresponding to the original data five times, and equal to that corresponding to the original data twenty four times for eight datasets. Based on the above analysis, we can conclude that the quality of K-means algorithm with extracted features is better than that with original data av-eragely for the used datasets.
Table 6 gives the mean of the evaluation criteria values of 100 runs of K-means with features in the prior scale.
CBF
CC
Trace
Gun
Temp
0
2
3
5
8
4
Table 1: The energy ratio (%) of the wavelet coefficients within various scales for all the used datasets
scale	CBF	CC	Trace	Gun	Reality	ECG	Income	Popu	Temp
1	8.66	6.34	0.54	0.18	0.03	18.22	5.07	0.12	4.51
2	6.00	5.65	1.31	1.11	0.1	26.70	2.03	0.46	7.72
3	7.67	40.42	2.60	2.20	0.29	19.66	1.49	7.31	5.60
4	11.48	19.73	4.45	7.85	3.13	12.15	26.08	8.10	4.57
5	18.97	9.98	6.75	15.58	3.85	8.97	28.49	13.68	9.92
6	32.25	17.87	14.66	54.81	8.94	7.11	26.39	21.94	4.29
7	15.62		39.66	14.43	21.39	3.55	10.46	48.39	16.60
8			29.56	4.02	20.01	1.80			42.62
9			0.46		19.41	1.83			4.16
10					22.84				
Table 2: The appropriate scales of all nine datasets									
	CBF	CC	Trace	Gun	Reality	ECG	Income	Popu	Temp
scale	2	2	1	1	1	1	3	1	1
The difference between the mean of criteria values produced by K-means algorithm with extracted features and that of criteria values generated by the features in the priori scale validated by t-test is described in Table 7. The mean of the criteria values corresponding to the extracted features are twelve times bigger, nineteen times equal, and nine times small than that corresponding to the features located in priori scale. The mean of the evaluation criteria values of 100 runs of K-means with features in the posterior scale are shown in Table 8. Table 9 provides the t-test result of the difference between the clustering criteria values of extracted features and the clustering criteria produced by the features within posterior scale. The mean of the criteria values corresponding to the extracted features are ten times bigger than that corresponding to the features located in the posterior scale, twenty nine times equal to that corresponding to the features located in the posterior scale, and only smaller than the features in the posterior scale one time. Based on the result of hypothesis testing, we can conclude that the quality of K-means algorithm with extracted appropriate features is better than that with features in the prior scale and posterior scale averagely for the used datasets.
4.3.3 Hierarchical Clustering with and without Feature Extraction
We used single linkage for the hierarchical clustering algorithm in our experiments. Figure 3 provides the comparison of the execution time of performing hierarchical clustering algorithm with original data (termed by HC) and the chain process of feature extraction plus hierarchical clustering algorithm (termed by HC + FE). For clearly observing the difference between the execution time of HC and HC + FE, the execution time is also given in Table 10. The chain process executes faster than hierarchical clustering with origi-
nal data for all nine datasets.
H HC I I HC+FE
Trace Gun ECG Income Popu Temp Reality
■n
1	2	3
5	6	7	8	9
Figure 3: The execution time (s) of the HC + FE and HC
Evaluating the quality of the hierarchial clustering algorithm can be divided into subject way and objective way. Dendrograms are good for subjectively evaluating hierarchical clustering algorithm with time series data [27]. As only Reality dataset has one time series in each cluster that is suitable for visual observation, we used it for subjective evaluation and other datasets are evaluated by objective criteria. Hierarchical clustering with Reality dataset and its extracted features had the same clustering solution. Note that this result is fit with the introduction of the dataset as Euclidean distance produces the intuitively correct clustering [26]. The dendrogram of the clustering solution is shown in Figure 4.
As each run of hierarchical clustering for the same
60
CC
50
40
30
20
CBF
10
0
4
Table 3: The mean of the evaluation criteria values obtained from 100 runs of K-means algorithm with eight datasets
	CBF	CC	Trace	Gun	ECG	Income	Popu	Temp
Jaccard	0.3490	0.4444	0.3592	0.3289	0.2048	0.6127	0.7142	0.7801
Rand	0.6438	0.8529	0.7501	0.4975	0.5553	0.7350	0.7611	0.8358
FM	0.5201	0.6213	0.5306	0.4949	0.3398	0.7611	0.8145	0.8543
CSM	0.5830	0.6737	0.5536	0.5000	0.4240	0.8288	0.8211	0.8800
NMI	0.3450	0.7041	0.5189	0.0000	0.0325	0.4258	0.5946	0.6933
Table 4: The mean of the evaluation criteria values obtained from 100 runs of K-means algorithm with the extracted features
	CBF	CC	Trace	Gun	ECG	Income	Popu	Temp
Jaccard	0.3439	0.4428	0.3672	0.3289	0.2644	0.6344	0.7719	0.8320
Rand	0.6447	0.8514	0.7498	0.4975	0.4919	0.7644	0.8079	0.8758
FM	0.5138	0.6203	0.5400	0.4949	0.4314	0.7770	0.8522	0.8912
CSM	0.5751	0.6681	0.5537	0.5000	0.4526	0.8579	0.8562	0.9117
NMI	0.3459	0.6952	0.5187	0.0000	0.0547	0.4966	0.6441	0.7832
Figure 4: The dendrogram of hierarchical clustering with extracted features and that with original data for Reality dataset
dataset always gets the same result, we don't need multiple runs, the criteria values obtained from extracted features are compared to that obtained from original data directly without hypothesis testing. Table 11 describes the evaluation criteria values produced by hierarchical clustering with eight original datasets. Table 12 gives the evaluation criteria values obtained from hierarchical clustering with extracted features. The difference between the items in Table 11 and Table 12 is provided in Table 13. The meaning of the characters in Table 13 is described as below: '>' means a criterion value produced by extracted features is bigger than that produced by original data; '<' denotes a criterion value obtained from extracted features is smaller than that obtained from original data; Otherwise, we set the character as '='. Hierarchical clustering with extracted features produces same result as clustering with the original data on CBF, Trace, Gun, Popu and Temp datasets. For other three datasets, the evaluation criteria values pro-
duced by hierarchial clustering with extracted features are ten times bigger than, four times smaller than, and one time equal to that obtained from hierarchical clustering with original data. From the experimental results, we can conclude that the quality of hierarchical clustering with extracted features is better than that with original data aver-agely for the used datasets.
The criteria values produced by hierarchical clustering algorithm with features in the prior scale are given in Table 14. The criteria values corresponding to the extracted features shown in Table 12 are nine times bigger than, five times small than, and twenty six times equal to the criteria values corresponding to the features in the prior scale. Table 15 shows the criteria values obtained by hierarchical clustering algorithm with features in the posterior scale. The criteria values produced by the extracted features given in Table 12 are eleven times bigger than, four times smaller than, and twenty five times equal to the criteria values corresponding to the features in the posterior scale. From the experimental results, we can conclude that the quality of hierarchical clustering with extracted features is better than that of hierarchical clustering with features located in the priori and posterior scale averagely for the used datasets.
5 Conclusions
In this paper, unsupervised feature extraction is carried out in order to improve the time series clustering quality and speed the clustering process. We propose an unsupervised feature extraction algorithm for time series clustering using orthogonal wavelets. The features are defined as the approximation coefficients within a specific scale. We show that the sum of squared errors between the approximation series reconstructed from the features and the time-series is
Table 5: The difference between the mean of criteria values produced by K-means algorithm with extracted features and with original datasets validated by t-test
	CBF	CC	Trace	Gun	ECG	Income	Popu	Temp
Jaccard	<	=	=	=	>	>	=	=
Rand	>	=	=	=	<	>	=	=
FM	<	=	=	=	>	>	=	=
CSM	<	=	=	=	>	>	=	=
NMI	=	<	=	=	>	>	=	>
Table 6: The mean of the evaluation criteria values obtained from 100 runs of K-means algorithm with features in the prior scale
	CBF	CC	Trace	Gun	ECG	Income	Popu	Temp
Jaccard	0.3489	0.4531	0.3592	0.3289	0.2048	0.6138	0.7142	0.7801
Rand	0.6438	0.8557	0.7501	0.4975	0.5553	0.7376	0.7611	0.8358
FM	0.5200	0.6299	0.5306	0.4949	0.3398	0.7615	0.8145	0.8543
CSM	0.5829	0.6790	0.5536	0.5000	0.4240	0.8310	0.8211	0.8800
NMI	0.3439	0.7066	0.5189	0.0000	0.0325	0.4433	0.5946	0.6933
Table 7: The difference between the mean of criteria values produced by K-means algorithm with extracted features and with features in the priori scale validated by t-test
	CBF	CC	Trace	Gun	ECG	Income	Popu	Temp
Jaccard	<	<	=	=	>	>	=	=
Rand	>	<	=	=	<	>	=	=
FM	<	<	=	=	>	>	=	=
CSM	<	<	=	=	>	>	=	=
NMI	>	<	=	=	>	>	=	>
Table 8: The mean of the evaluation criteria values obtained from 100 runs of K-means algorithm with features in the posterior scale
	CBF	CC	Trace	Gun	ECG	Income	Popu	Temp
Jaccard	0.3457	0.4337	0.3632	0.3289	0.2688	0.4112	0.7770	0.8507
Rand	0.6455	0.8482	0.7501	0.4975	0.4890	0.5298	0.8141	0.8906
FM	0.5158	0.6114	0.5352	0.4949	0.4388	0.5826	0.8560	0.9031
CSM	0.5771	0.6609	0.5545	0.5000	0.4663	0.6205	0.8611	0.9216
NMI	0.3474	0.6868	0.5190	0.0000	0.0611	0.0703	0.6790	0.8037
Table 9: The difference between the mean of criteria values produced by K-means algorithm with extracted features and with features in the posterior scale validated by t-test
	CBF	CC	Trace	Gun	ECG	Income	Popu	Temp
Jaccard	=	>	=	=	=	>	=	=
Rand	=	>	=	=	=	>	=	=
FM	=	>	=	=	=	>	=	=
CSM	=	>	=	=	<	>	=	=
NMI	=	>	=	=	=	>	=	=
Table 10: The execution time (s) of HC + FE and HC
	CBF	CC	Trace	Gun	ECG	Income	Popu	Temp	Reality
HC	13.7479	51.3319	2.7102	2.1256	0.3673	0.0169	0.0136	0.0511	0.0334
HC + FE	12.1269	49.7365	2.6423	2.0322	0.2246	0.0156	0.0133	0.0435	0.0172
Table 11: The evaluation criteria values produced by hierarchical clustering algorithm with the original eight datasets
	CBF	CC	Trace	Gun	ECG	Income	Popu	Temp
Jaccard	0.3299	0.5594	0.4801	0.3289	0.3259	0.5548	0.4583	0.4497
Rand	0.3369	0.8781	0.7488	0.4975	0.3619	0.5800	0.5211	0.4877
FM	0.5714	0.7378	0.6827	0.4949	0.5535	0.7379	0.6504	0.6472
CSM	0.4990	0.7540	0.6597	0.5000	0.4906	0.6334	0.6386	0.6510
NMI	0.0366	0.8306	0.6538	0.0000	0.0517	0.1460	0.1833	0.1148
Table 12: The evaluation criteria values obtained by hierarchical clustering algorithm with appropriate features extracted from eight datasests
	CBF	CC	Trace	Gun	ECG	Income	Popu	Temp
Jaccard	0.3299	0.5933	0.4801	0.3289	0.3355	0.5068	0.4583	0.4497
Rand	0.3369	0.8882	0.7488	0.4975	0.3619	0.5200	0.5211	0.4877
FM	0.5714	0.7682	0.6827	0.4949	0.5696	0.6956	0.6504	0.6472
CSM	0.4990	0.7758	0.6597	0.5000	0.4918	0.6402	0.6386	0.6510
NMI	0.0366	0.8525	0.6538	0.0000	0.0847	0.0487	0.1833	0.1148
Table 13: The difference between the criteria values obtained by hierarchical clustering algorithm with eight datasets and with features extracted from the datasets
	CBF	CC	Trace	Gun	ECG	Income	Popu	Temp
Jaccard	=	>	=	=	>	<	=	=
Rand	=	>	=	=	=	<	=	=
FM	=	>	=	=	>	<	=	=
CSM	=	>	=	=	>	>	=	=
NMI	=	>	=	=	>	<	=	=
Table 14: The evaluation criteria values obtained by hierarchical clustering algorithm with features in the prior scale
	CBF	CC	Trace	Gun	ECG	Income	Popu	Temp
Jaccard	0.3299	0.5594	0.4801	0.3289	0.3259	0.5548	0.4583	0.4497
Rand	0.3369	0.8781	0.7488	0.4975	0.3619	0.5800	0.5211	0.4877
FM	0.5714	0.7378	0.6827	0.4949	0.5535	0.7379	0.6504	0.6472
CSM	0.4990	0.7540	0.6597	0.5000	0.4906	0.6334	0.6386	0.6510
NMI	0.0366	0.8306	0.6538	0.0000	0.0517	0.1460	0.1833	0.1148
Table 15: The evaluation criteria values obtained by hierarchical clustering algorithm with features in the posterior scale
	CBF	CC	Trace	Gun	ECG	Income	Popu	Temp
Jaccard	0.3299	0.4919	0.4801	0.3289	0.3355	0.5090	0.4583	0.4343
Rand	0.3369	0.8332	0.7488	0.4975	0.3619	0.5467	0.5211	0.5123
FM	0.5714	0.6973	0.6827	0.4949	0.5696	0.6908	0.6504	0.6207
CSM	0.4990	0.6640	0.6597	0.5000	0.4918	0.6258	0.6386	0.6340
NMI	0.0366	0.7676	0.6538	0.0000	0.0847	0.0145	0.1833	0.1921
equal to the energy of the wavelet coefficients within this scale and lower scales. Based on this property, we leverage the conflict of taking lower dimensionality and lower sum of squared errors simultaneously by finding the scale within which the energy of wavelet coefficients is lower than that within the nearest lower scale. An efficient feature extraction algorithm is designed without reconstructing the approximation series. The time complexity of the feature extraction algorithm can achieve O{mn) with Haar wavelet transform. The main benefit of the proposed feature extraction algorithm is that dimensionality of the features is chosen automatically.
We conducted experiments on nine time series datasets using K-means and hierarchical clustering algorithm. The clustering results were evaluated by subjective observation and five objective criteria. The chain process of performing feature extraction firstly then executing clustering algorithm with extract features executes faster than clustering directly with original data for all the used datasets. The quality of clustering with extracted features is better than that with original data averagely for the used datasets. The quality of clustering with extracted appropriate features is also better than that with features corresponding to the scale prior and posterior to the appropriate scale.
References
[1]	http://www.bea.gov/bea/regional/spi.
[2]	http://www.ncdc.noaa.gov/rcsg/ datasets.html.
[3]	http://www.census.gov/population/www/ estimates/st_stts.html.
[4]	R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. In Proceedings of the 4th Conference on Foundations of Data Organization and Algorithms, pages 69-84, 1993.
[5]	R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton, NJ, 1961.
[6]	K. Beyen, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbor meaningful? In
Proceedings of the 7th International Conference on Database Theory, pages 217-235, 1999.
[7]	C. S. Burrus, R. A. Gopinath, and H. Guo. Introduction to Wavelets and Wavelet Transforms, A Primer. Prentice Hall, Englewood Cliffs, NJ, 1997.
[8]	K. P. Chan and A. W. Fu. Efficient time series matching by wavelets. In Proceedings of the 15th International Conference on Data Engineering, pages 126133, 1999.
[9]	K. P. Chan, A. W. Fu, and T. Y. Clement. Harr wavelets for efficient similarity search of time-series: with and without time warping. IEEE Trans. on Knowledge and Data Engineering, 15(3):686-705, 2003.
[10]	C. K. Chui. An Introduction to Wavelets. Academic Press, San Diego, 1992.
[11]	T. Cover and J. Thomas. Elements of Information Theory. Wiley Series in Communication, New York, 1991.
[12]	M. Dash, H. Liu, and J. Yao. Dimensionality reduction of unsupervised data. In Proceedings of the 9th IEEE International Conference on Tools with AI, pages 532-539, 1997.
[13] I. Daubechies. Ten Lectures on Wavelets. Philadelphia, PA, 1992.
SIAM,
[14]	D. L. Donoho. De-noising by soft-thresholding. IEEE Trans. on Information Theory, 41(3):613-627, 1995.
[15]	J. G. Dy and C. E. Brodley. Feature subset selection and order identification for unsupervised learning. In
Proceedings of the 17th International Conference on Machine Learning, pages 247-254, 2000.
[16]	X. Z. Fern and C. E. Brodley. Solving cluster ensemble problems by bipartite graph partitioning. In
Procedings of the 21th International Conference on Machine Learning, 2004. Article No. 36.
[17]	M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani. Mining the stock market: Which measure is best? In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 487-496, 2000.
[18]	P. Geurts. Pattern extraction for time series classification. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, pages 115-127, 2001.
[19]	A. L. Goldberger, L. A. N. Amarai, L. Glass, J. M. Hausdorff, P. Ch. Ivanov, R. G. Mark, J. E. Mi-etus, G. B. Moody, C.-K. Peng, and H. E. Stanley. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation, 101(23):e215-e220, 2000. http://www.physionet.org/physiobank/database/.
[20]	M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validataion techniques. Journal of Intelligent Information Systems, 17(2-3):107-145, 2001.
[21]	J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Fransisco, 2000.
[22]	D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: A survey. IEEE Trans. On Knowledge and data engineering, 16(11):1370-1386, 2004.
[23]	S. Kaewpijit, J. L. Moigne, and T. E. Ghazawi. Automatic reduction of hyperspectral imagery using wavelet spectral analysis. IEEE Trans. on Geoscience and Remote Sensing, 41(4):863-871, 2003.
[24]	K. Kalpakis, D. Gada, and V. Puttagunta. Distance measures for effective clustering of arima time-series.
In Proceedings of the 2001 IEEE International Conference on Data Mining, pages 273-280, 2001.
[25]	E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehro-tra. Dimensionality reduction of fast similarity search in large time series databases. Journal of Knowledge and Information System, 3:263-286, 2000.
[26]	E. Keogh and T. Folias.	The ucr time series data mining archive. http://www.cs.ucr.edu/~eamonn/TSDMA/ in-dex.html, 2002.
[27]	E. Keogh and S. Kasetty. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Mining and Knowledge Discovery, 7(4):349-371, 2003.
[28]	E. Keogh and M. Pazzani. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In
Proceedings of the 4th International Conference of Knowledge Discovery and Data Mining, pages 239241, 1998.
[29]	S. W. Kim, S. Park, and W. W. Chu. Efficient processing of similarity search under time warping in sequence databases: An index-based approach. Information Systems, 29(5):405-420, 2004.
[30]	R. Kohavi and G. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2):273-324, 1997.
[31]	F. Korn, H. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queries in large datasets of time sequences. In Proceedings of The ACM SIGMOD International Conference on Management of Data, pages 289-300, 1997.
[32]	J. Lin, E. Keogh, S. Lonardi, and B. Chiu. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 2-11, 2003.
[33]	J. Lin, M. Vlachos, E. Keogh, and D. Gunopulos. Iterative incremental clustering of time series. In Proceedings of 9th International Conference on Extending Database Technology, pages 106-122, 2004.
[34]	H. Liu and H. Motoda. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic, Boston, 1998.
[35]	S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, San Diego, second edition, 1999.
[36]	I. Popivanov and R. J. Miller. Similarity search over time-series data using wavelets. In Proceedings of the 18th International Conference on Data Engineering, pages 212-221, 2002.
[37]	D. Rafiei and A. Mendelzon. Efficient retrieval of similar time sequences using dft. In Proceedings of the 5th International Conference on Foundations of Data Organizations, pages 249-257, 1998.
[38]	N. Saito. Local Feature Extraction and Its Application Using a Library of Bases. PhD thesis, Department of Mathematics, Yale University, 1994.
[39]	G. W. Snedecor and W. G. Cochran. Statistical Methods. Iowa State University Press, Ames, eight edition, 1989.
[40]	A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3(3):583-617, 2002.
[41]	L. Talavera. Feature selection as a preprocessing step for hierarchical clustering. In Proceedings of the 16th International Conference on Machine Learning, pages 389-397, 1999.
[42]	Y. Wu, D. Agrawal, and A. E. Abbadi. A comparison of dft and dwt based similarity search in time-series databases. In Proceedings of the 9th ACM CIKM International Conference on Information and Knowledge Management, pages 488-495, 2000.
[43]	N. Wyse, R. Dubes, and A. K. Jain. A critical evaluation of intrinsic dimensionality algorithms. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice, pages 415-425. Morgan Kaufmann Publishers, Inc, San Mateo, CA, 1980.
[44]	Y. Xiong and D. Y. Yeung. Time series clustering with arma mixtures. Pattern Recognition, 37(8):1675-1689, 2004.
[45]	B. K. Yi and C. Faloustos. Fast time sequence indexing for arbitrary Ip norms. In Proceedings of the 26th International Conference on Very Large Databases, pages 385-394, 2000.
[46]	H. Zhang, T. B. Ho, and M. S. Lin. A non-parametric wavelet feature extractor for time series classification.
In Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 595603, 2004.
Coloring Weighted Series-Parallel Graphs
Gašper Fijavž
Faculty of Computer and Information Science E-mail: gasper.fijavz@fri.uni-lj.si
Keywords: graph coloring, circular coloring, weighted graphs Received: December 31, 2003
Let G bea series-parallel graph with integer edge weights. A p-coloring of G is a mapping of vertices of G into Zp (ring of integers modulo p) so that the distance between colors of adjacent vertices u and v is at least the weight of the edge uv. We describe a quadratic time p-coloring algorithm where p is either twice the maximum edge weight or the largestpossible sum of three weights of edges lying on a common cycle.
Povzetek: Opisano je barvanje grafov.
1 Introduction
The motivation of the problem is twofold. An instance of coloring edge weighted graphs is the channel assignment problem, cf. [4]. On the other hand, traditional vertex coloring of (unweighted) graphs can be viewed as a circular one—consider the colors to lie in an appropriate ring of integer residues. Circular colorings of graphs, see [8] for a comprehensive survey, where we allow the vertices to be colored by real numbers (modulo p) model several optimization problems better than traditional colorings of graphs. Circular chromatic number, the minimum p for which a circular coloring exists, is a refinement of the chromatic number of a graph, and similarly NP-hard to compute.
If the largest complete minor in (an unweighted graph) G has k vertices and k < 6, then the valid cases of Hadwiger conjecture imply x{G) < k, see [7].
Let G = (V, E, w) be a weighted graph (where (V, E) is the underlying unweighted graph) with edge weights w (and w : E ^ [1, to)). We can, similarly as in the unweighted case, define the size of the largest complete minor, see [5, 6, 3]: the size of the largest weighted K 2 -minor in G is twice the maximal edge weight, and for the size of the largest weighted complete K3 minor we have also to consider the biggest possible sum of weights of three edges lying on the common cycle. If G is a series parallel graph then the largest of the above-mentioned quantities is called the weighted Hadwiger number of G, which we denote by h(G).
The weighted case of Hadwiger conjecture is valid only for graphs satisfying h(G) < 4, i.e., it is true that if h(G) < 4, then the weighted chromatic number of G, which we denote by Xw (G), is at most h(G) [3]. If a weighted graph G is not series-parallel, then it may occur that xw (G) > h(G), see [3] for examples.
Hence, for series-parallel weighted graphs h(G) is a natural upper bound for xw (G). We present an algorithm for h(G)-coloring weighted series-parallel graphs. As op-
posed to results in [3], the coloring algorithm presented here successfully colors series-parallel graphs with at most h(G) colors even if the ratio between maximal and minimal edge weights exceeds 2.
2 Definitions and preprocessing
Let N denote the set of positive integers and let Zp denote the ring of integers modulo p. If x,y e Zp then we denote the distance between x and y in Zp by \x - y|p. Let G = (V, E, w) be a weighted series-parallel graph. Series-paralel graphs are constructed by first pasting triangles along edges (starting with a triangle), and then deleting edges [2]. In order to avoid computational difficulties concerning real numbers we shall assume that weights are integers, w : E ^ N. A p-coloring of G is a mapping c : V ^ Zp so that for every edge e = uv the condition
\c(u) - c(v)\p > w(uv)
is satisfied. Given a p-coloring c and an edge e = uv, we call \c(v) - c(u)\p the span of e, denoted by span(e), and say that e is tight if its span equals its weight. We shall also say that p is the size of the color space Zp.
2.1 Tree decomposition
Tree decomposition, see [2] for the theoretical background, of a series-parallel graph can be computed in lineartime [1]. Given a tree-decomposition (tg, V) of G we can by adding edges to G (and setting their weights to 1) assume that G is an edge-maximal series parallel graph. The parts V of the decomposition are exactly the edges and triangles of G. Two parts are adjacent (in TG) if and only if one part is a triangle t, the other is an edge e, and e is incident with t.
Hence, G is 2-connected, and given distinct edges e and f from G, there exists a cycle containing both. We shall
use both G and its tree decomposition (Tg , V) for storing the graph during the course of the coloring algorithm.
Let e = viv2 be an edge in G. If {vi,v2} is a separator in G we say that an edge e is a separating edge in G, and e is called nonseparating otherwise. If e = viv2 is separating and G -{vi, V2} consists of k components Ci,.. .,Ck, then G i (i = 1,..., k) denotes the graph (infact its representation) induced by vertices of Ci and {v1, v2}. We call Gj's (i = 1,. ..,k) the e-splits of G.
Throughout the algorithm we shall keep track whether an edge e is a separating edge of G. This can be easily seen from TG, namely, an edge e is nonseparating if it is adjacent to a single triangle in TG.
Let t = e1e2e3, be a triangle (t contains edges e1, e2, and e3) in G. Let us further assume that e1 is a separating edge and let Go, G1.. .,Gk be all e1-splits of G, so that Go contains triangle t as its subgraph. Then the (graph) union G1 U ... U Gk is called the (t, e1)-fragment of G and is denoted by G(t, e1). If e is nonseparating, and there exists a triangle t containing e (there may be at most one), then the (t, e)-fragment of G is the graph containing only e together with its endvertices. We call a graph trivial if it contains at most two vertices.
2.2 Heavy cycle, heavy triangle
As noted in the introduction h(G), the hadwiger number of G, equals either twice the weight of the heaviest edge or the sum of three largest edge weights of edges lying on a common cycle. It is the latter option that is more appealing to our problem.
Let t = f1f2f3 be a triangle in G. Define G1 = G(t,f1), G2 = G(t,f2), and G3 = G(t,f3). If Gi (i = 1, 2, 3) is trivial, then we say that ei = fi is a realizing edge of t (in Gi). If Gi is not trivial then every heaviest edge in G i can be chosen as a realizing edge of t (in Gi). Weight of a triangle t, w(t), is defined as the sum of edge weights of edges realizing t. Clearly enough, the realizing edges of a triangle lie on a common cycle in G.
Let e1 and e2 be distinct edges with largest edge weights in G. Triangle t is called a heavy triangle if w(t) equals h(G), and both e1 and e2 are realizing edges of G. It may occur that no triangle is heavy in G. In this case we can by increasing weight of a single edge construct a heavy triangle in G while not increasing h(G). This is the essence of the procedure heavyTriangle described in the next section.
By scanning through the edges of G, we find some heaviest edge ea = uava. Next we run
heavyTriangle(G, ea, ea, ea) ^ h(G); t, fa, fb, fc] e^, e^, P.
Finally, we set p = h(G), c(ua) = 0, c(va) = w(ea) and run the main coloring procedure
color(G,p, t; fa, fb, fc] ea, eb, ec] P)
using a heavy triangle t = fafb fc with its realizing edges ea, eb, and ec as arguments.
3 Coloring algorithm
The coloring algorithm is recursive. Given a graph G with two precolored adjacent vertices ua and va we split G along a carefully chosen edge(s) into several subgraphs, say G0, G1,... Only one of these, say Go, contains both Ua and va, and it is the first one to get colored. We find colorings of G1, G2,... recursively, taking care that exactly two vertices of G j are already colored when it is G j 's turn.
3.1 Looking for a heavy triangle
We shall first describe the routine heavyTriangle. The input for his routine consists of weighted graph G, edges ea and eb (ea is heaviest in G, and eb is either second heaviest in G or ea = eb), and a path P C TG linking edges ea and eb (P is trivial in case ea = eb).
The routine heavyTriangle outputs, apart from the possibly new eb and P, also the hadwiger number h(G), a heavy triangle t = fa fb fc, and its third realizing edge ec. We set the notation so that ea e E(G(t, fa)), eb e E(G(t, fb)), and ec e E(G(t, fc)), and assume that h(G) = w(ea) + w(eb) + w(ec).
We use the following shorthand
heavyTriangle(G, ea, eb, P) ^ h(G); t, fa, fb, fc] eb, ec, P.
The routine runs as follows: (T1) if ea = eb then
we find some second heaviest edge in G and adjust P so that P links ea and the newly determined eb. Hence ea =
eb.
(T2) For every triangle t we compute the realizing edges and its weight w(t). This can be done by tracing TG starting from ea first. Hence ea is one of the realizing edges in every triangle t. By retracing towards ea from the leaves of Tg we compute the other two realizing edges of every triangle recursively. Finally we set that eb is one of the realizing edges in every triangle lying in P (in the direction from eb to ea).
(T3) Find the triangle t' with largest possible w(t'). Set h(G) = max{w(t'), 2w(ea)}. (T4) If h(G) > w(t') or
h(G) = w(t') and eb is not one of the realizing edges of t' then do the following:
Let t = fafbfc be an arbitrary triangle from P so that e a e
G (t, fa) and eb e G (t, fb). Set ec = fc and increase the
weight of ec = fc by setting w(ec) = h(G) - w(ea) -
w(eb). Note that increasing weight of ec does not increase
h(G), as ea and eb are heaviest edges in G.
(T5) If h(G) = w(t') a^
eb is one of the realizing edges in t'
then :
By (T2) ea is also one of the realizing edges in t'. Set t = t'. Further, set ec to be the third realizing edge in t = fafbfc where the notation of edges in t is chosen so
that ea e E(G(t, fa)), etc.
(T6) output h{G); t, fa, fb, fc] e^, ec, P.
It is not difficult to see that heavyTriangle runs in linear time.
3.2 Recursion
We shall first describe a routine for coloring a graph with small edge weights. Let e = uv be the heaviest edge in G, and assume that p > 3w(e), where p denotes the size of the color space. Let us also assume that colors c(u) and c(v) are already determined so that the span(e) is at most p -2w(e). Procedure colorCgraph with G, p, and e as its input (satisfying the above conditions) extends the coloring c to the remaining vertices of G. This can be done by tracing along Tg starting at e, and taking care that every edge f G G satisfies span(f) < p - 2w(e) (as w(f ) < w(e)). It is easy to implement colorCgraph to run in linear time.
We turn our attention to coloring the graph in case its edge weights (at least some of them) are large when compared to h(G). Let G be a weighted graph, p an upper bound for h(G), t = fafbfc a heavy triangle, and ea, eb, and ec its realizing edges (so that ea G E(G(t, fa)), etc.). Let P be a path in TG joining ea and eb, and suppose that a coloring of endvertices of ea is given so that ea is tight. Then
Coloring Principle. With the assumptions as above the procedure color extends the coloring c to the rest of G so that
(i)	apart from ea the edge eb is also tight, and
(ii)	span(ec) < p - w(ea) - w(eb). The call
color(G,p, t; fa, fb, fc] ea, eb, e^ P)
splits into three cases, and exactly one of them applies. These three cases will also serve as a recursive proof that a graph can indeed be colored according to the principle. The first case (C1) serves as the recursion basis, the last two cases (C2) and (C3) serve as recursion steps. (C1) if G contains a single triangle t then. In this case ea = fa, eb = fb, and ec = fc. Let u and v be the (colored) endvertices of ea, and let w be the common endvertex of eb and ec. There exists a unique color c(w) so that eb is tight and span(ec) = p - w(ea) - w(eb). Hence, we can extend the coloring to G according to the coloring principle. exit
(C2) if ea is a separating edge in G then
let Go,Gi,... ,Gk be the ea-splits of G so that Gq contains
eb, ec, fa, fb, fc, t, and P. We first color Gq by calling color(Go,p, t; fa, fb, fc] ea, eb, ec] P) and then take care of the other splits: for i = 1 to k do heavyTriangle(Gj,p,ea,ea,ea) ^
h(Gi),ti; fai, fbi, fci; ebi, eci, Pi for i = 1 to k do
color(Gi,p, ti; fai, fbi, fci; ea, ebi, eci; Pi) exit
(C3) if ea is nonseparating in G then
we first increase weights of fb and fc by setting w(fc) =
w(ec) and w(fb) = w(eb).
Let G a be the graph containing G(t, ea) and triangle t. Observe that either Ga contains at least two triangles or at least one of G(t, fb), G(t, fc) is not trivial (i.e. at least one of fb, fc is separating in G). Let Pa be the subpath of P linking fb and ea. Since w(fb) = w(eb) the edge fb is second heaviest in Ga. if ea = fa then
color(Ga,p, t; ea, fb, fc; ea, fb, fc; Pa) Note that in the above case Ga contains a single triangle t as ea is not separating in G.
Observe that w(fa) < w(eb) = w(fb), as eb is second heaviest in G and ea = fa. Hence, we increase the weight by setting w(fa) = w(fb), which makes fa second heaviest in G(t, fa). Let P' be the subpath of Pa linking ea and fa, let G' be the graph G(t, fa), and let G" be the subgraph of G induced by triangle t. heavyTriangle(G',p,ea,fa,P') ^
h(G'),t'; fa,f',fc; fa,e'c,P'
color(G',p,t'; fa, fb, fc ; ea,fa,eJ„; P ') After coloring G' the edge fa is tight and we also
color(G'',p,t; fa, fb, fc; fa, fb, fc; fatfb) end if
Note that at this point endvertices of both fb and fc are colored. What is more, fb is tight, and by recursion,
span(fc) < p - w(ea) - w(fb) = p - w(ea) - w(eb).
Finally we settle the uncolored parts. if fb is separating in G then heavyTriangle(G(t, fb), fb, eb, ebPfb) ^
h(G(t, fb)),ti;; fai, fbi, fci; ebi,eci,Pi color(G(t, fb),p, ti; fai, fbi, fci; fb, ebi, eci; Pi) if fc is separating in G then
colorCgraph(G(t, fc),p, fc) exit
4 Time complexity
The last section is devoted to estimating the speed of the coloring algorithm.
TIME COMPLEXITY. There exists a constant C so that for every weighed series parallel graph G of order n, the running time of the described coloring algorithm is bounded above by Cn?. In other words, we can h(G)-color a weighted series parallel graph G in quadratic time.
As already mentioned, the preprocessing takes linear amount of time. After preprocessing G is an edge maximal series parallel graph. If G contains n + 3 vertices then G contains n triangles, 2n +1 edges, and 3n lines (edge-
triangle incidencies). All these quantities are equally appropriate for measuring the size of the problem.
Let T(n) denote the maximal running time for the color procedure taking a graph G with n triangles as an input. We have to show that T (n) < Cn2 assuming T (m) < Cm2 for every m < n.
Let Don be the upper bound for the running times of both heavyTriangle and colorCgraph if they take agraph G containing n triangles as input.
A call of color with G as its argument takes one of the three possible options: (C1), (C2), or (C3). The running time of (C1) is bounded above by a constant, say Di.
If (C2) applies let G0,G1,...,Gk be the splits. Observe that k > 1. Since Gj's together contain exactly n triangles, the recursively called procedures heavyTriangle cumulatively take no more than Don running time.
Let (no, ni, n2,..., nk) be a proper (integer) partition of n, i.e. no,ni,n2,... ,nk > 1, k > 1, and no + ni + • • • + nk = n. Then
n2 + n2 + •
• + ni < no + (ni + •
• + ni)2 < (n - 1)2 + 1 (1)
Now (1) implies that the cumulative running time of recursive calls of procedure color in (C2) is bounded from above by C(n - 1)2 + C. Summing it all up, the running time of (C2) is bounded from above by C(n-1)2 +C+Don+D2n if we use at most D2n time for running the loops (excluding time for recursive calls of heavyTriangle and color).
The case when (C3) applies is settled similarly as above. Assume that the base running time of (C3) (i.e. the running time excluding running times of recursive calls of color, colorCgraph, and heavyTriangle) is bounded by constant D3 . Recursive color-ing and colorCgraph-ing takes at most C (n-1)2 + C, and heavyTriangle-s takeatmost D0n time. Combining all three possibilities yields T(n) < max{Di, C(n - 1)2 + C + D0n + D2n, C(n - 1)2 + C + Don + D3} < max{Di, Cn2 + (-2Cn + 2C + Don + D2n), Cn2 + (-2Cn + C + Don + D3)}, which is, if C is large enough, at most Cn2. This proves the assertion on time complexity.
Acknowledgment
The author's research was conducted while visiting University of Hannover under sponsorship of Alexander von Humboldt Foundation. Both hospitality of the university and help of the foundation are greatly acknowledged.
[3]	G. Fijavž. Hadwiger's conjecture for circular colorings of edge-weighted graphs. Discrete Math., to appear.
[4]	C. McDiarmid. Discrete mathematics and radio channel assignment. In Recent advances in algorithms and combinatorics, volume 11 of CMS Books Math./Ouvrages Math. SMC, pages 27-63. Springer, New York, 2003.
[5]	B. Mohar. Circular colorings of edge-weighted graphs. J. Graph Theory, 43:107-116, 2003.
[6]	B. Mohar. Hajós theorem for colorings of edge-weighted graphs. Combinatorica, 25:65-76, 2004.
[7]	B. Toft. A survey of Hadwiger's conjecture. Congr. Numer., 115:249-283, 1996. Surveys in graph theory (San Francisco, CA, 1995).
[8]	X. Zhu. Circular chromatic number: a survey. Discrete Math., 229(1-3):371-410, 2001. Combinatorics, graph theory, algorithms and applications.
References
[1] H. L. Bodlaender. A linear-time algorithm for finding tree-decompositions of small treewidth. SIAM J. Com-put., 25(6):1305-1317, 1996.
[2] R. Diestel. Graph theory. Springer, New York, 1997.
Credit Classification Using Grammatical Evolution
Anthony Brabazon
University College Dublin, Ireland.
E-mail: Anthony.Brabazon@ucd.ie
Michael O'Neill
University of Limerick, Ireland.
E-mail: Michael.ONeill@ul.ie
Keywords: grammatical evolution, credit rating, bond rating Received: May 25, 2004
Grammatical Evolution (GE) is a novel data driven, model induction tool, inspired by the biological gene-to-protein mapping process. This study provides an introduction to GE, and demonstrates the methodology by applying it to model the corporate bond-issuer credit rating process, using information drawn from the financial statements of bond-issuing firms. Financial data and the associated Standard & Poor's issuer-credit ratings of 791 US firms, drawn from the year 1999/2000 are used to train and test the model. The best developed model was found to be able to discriminate in-sample (out-of-sample) between investmentgrade and junk bond ratings with an average accuracy of87.59 (84.92)% across a five-fold cross validation.
Povzetek: Metoda gramatične evolucije je uporabljena za klasificiranje kreditov.
1 Introduction
Grammatical Evolution (GE) [1], represents an evolutionary automatic programming methodology, and can be used to evolve rule sets. These rule sets can be as general as a functional expression which produces a good mapping between a series of known input-output data vectors. A particular strength of the methodology is that the form of the model need not be specified a priori by the modeler. This is of particular utility in cases where the modeler has a theoretical or intuitive idea of the nature of the explanatory variables, but a weak understanding of the functional relationship between the explanatory and the dependent vari-able(s). GE does not require that the model form is linear, nor does the method require that the measure of model error used in model construction is a continuous or differen-tiable function. Neither is GE a black box method. As such the evolved rules (taking the form of symbolic expressions in this instance) are amenable to human interpretation and consequently have the potential to enhance our understanding of the problem domain.
A key element of the methodology is the concept of a Grammar, which governs the creation of the rule sets. This paper describes the GE methodology, and applies the methodology to accurately model the corporate bond rating process.
Most large firms employ both share and debt capital to provide long-term finance for their operations. The debt capital may be provided by a bank, or may be obtained by selling bonds directly to investors. As an example of the scale of US bond markets, the value of bonds issued in the first quarter of 2003 totalled $1.70 trillion [2]. A bond can
be defined as a 'debt security which constitutes a promise by the issuing firm, to pay a stated rate of interest based on the face value of the bond, and to redeem the bond at this face value at maturity.' When a publicly-traded company wants to issue traded debt (bonds), it must obtain a credit rating for the issue from at least one recognised rating agency (Standard and Poor's (S&P), Moody's or Fitches'). The credit rating represents the rating agency's opinion, at a specific date, of the creditworthiness of a borrower in general (an issuer credit rating), or in respect of a specific debt issue (a bond credit rating). Therefore it serves as a surrogate measure of the risk of non-payment of interest or capital of a bond. These ratings impact on the borrowing cost and the marketability, of issued bonds.
1.1 Motivation for study
There are a number of reasons to suppose a priori that the use of an evolutionary automatic programming (EAP) approach such as GE, can prove fruitful in this domain.
In common with the related corporate failure prediction problem [3], a feature of the bond-rating problem is that there is no clear theoretical framework for guiding the choice of explanatory variables, or model form. Rating agencies assert that their credit rating process involves consideration of both financial and non-financial information about the firm and its industry, but the precise factors utilised, and the related weighting of these factors, are not publicly disclosed by the rating agencies. In the absence of an underlying theory, most published work on credit rating prediction employs a data-inductive modelling approach, using firm-specific financial data as explanatory
variables, in an attempt to recover the model used by the rating agencies. This produces a high-dimensional combinatorial problem, as the modeller is attempting to uncover a good set of model inputs, and model form, giving rise to particular potential for an evolutionary automatic programming methodology such as GE. In this initial application of GE to modelling credit rating, we restrict attention to the binary classification case (discriminating between investment grade vs junk grade ratings). This will be extended to the multi-class case in future work. It is noted that a limited number of studies have applied a grammar-based methodology to constrain the search space for classification rules [3, 4, 5, 6]. This study extends this methodology into the domain of bond-rating.
The rest of this contribution is organized as follows. The next section provides an overview of the literature on bond rating, followed by a section which describes Grammatical Evolution. We then outline the data set and methodology utilised. The following sections provide the results of the study followed by a number of conclusions.
AAA and BBB (inclusive) are deemed to represent investment grade, with lower quality ratings deemed to represent debt issues with significant speculative characteristics (junk bonds). A 'C' grade represents a case where a bankruptcy petition has been filed, and a 'D' rating represents a case where the borrower is currently in default on their financial obligations. As would be expected, the probability of default depends strongly on the initial rating which a bond receives (see table 1).
Initial Rating Defaults (%)
AAA	0.52
AA	1.31
A	2.32
BBB	6.64
BB	19.52
B	35.76
CCC	54.38
Table 1: Rate of default by initial rating category (1987-2002)(from [8]).
2 Bond rating
Several categories of individuals would be interested in a model that could produce accurate estimates of bond ratings. Such a model would be of interest to firms that are considering issuing debt as it would enable them to estimate the likely return investors would require if the debt was issued, thereby providing information for the pricing of the bonds. The model could also be used to assess the credit-worthiness of firms that have not issued debt and hence do not already have a published bond rating. This information would be useful to bankers or other companies that are considering whether they should extend credit to that firm. Much rated debt is publicly traded on stock markets, and bond ratings are typically changed infrequently. An accurate bond-rating prediction model could indicate whether the current rating of a bond is still justified. To the extent that an individual investor could predict a bond rerating before other investors foresee it, this may provide a trading edge. In addition, the recent introduction of credit-risk derivatives allows investors to buy protection against the risk of the downgrade of a bond [7]. The pricing of such derivative products requires a quality model for estimating the likelihood of a credit rating change.
2.1 Bond Rating Notation
Although the precise notation used by individual rating agencies to denote the creditworthiness of a bond or issuer varies, in each case the rating is primarily denoted by a discrete, mutually exclusive, 'letter grade'. Taking the rating structure of S&P as an example, the ratings are broken down into 10 broad classes. The highest rating is denoted AAA, and the ratings then decrease in the following order, AA, A, BBB, BB, B, CCC, CC, C, D. Ratings between
Ratings from AAA to CCC can be modified by the addition of a + or a -, to indicate at which end of the rating category the bond rating falls.
An initial rating is prepared when a bond is being issued, and this rating is periodically reviewed thereafter by the rating agency. Bonds (or issuers) may be re-rated upwards (upgrade) or downwards (downgrade) if firm or environmental circumstances change. A re-rating of a bond below investment grade to junk bond status (such bonds are colorfully termed 'a fallen angel') may trigger a significant sell-off as many institutional investors are only allowed, by external or self-imposed regulation, to hold bonds of investment grade. The practical affect of a bond (or issuer) being assigned a lower rather than a higher rating is that its perceived riskiness in the eyes of potential investors increases, and consequently the required interest yield of the bond rises.
2.2 Prior Literature
In essence, the objective of constructing a model of bond ratings, is to produce a model of rating agency behaviour, using publicly available information. A large literature exists on bond-rating prediction. Earliest attempts utilised statistical methodologies such as linear regression (OLS) [9], multiple discriminant analysis [10], the multi-nomial logit model [11], and ordered-probit analysis [12]. The results from these studies varied, and typically results of about 50-60% prediction accuracy (out-of-sample) were obtained, using financial data as inputs. With the advent of artificial intelligence and machine learning, the range of techniques applied to predict bond ratings has expanded to include neural networks [13]. In the case of prior neural network research, the predictive accuracy of the developed models has varied. Several studies employed a binary pre-
dictive target and reported good classification accuracies. For example, [14] used a neural network to predict AA or non-AA bond ratings, and obtained an accuracy of approximately 83.3%. However, a small sample size (47 companies) was adopted in the study, making it difficult to generalise strongly from its results.
3 Grammatical evolution
Evolutionary algorithms (EAs) operate on principles of evolution, usually being coarsely modelled on the theories of survival of the fittest and natural selection [15]. In general, evolutionary algorithms can be characterized as:
x[t +1]= r(«(s(x[t])))
(1)
where x\t] is the population of solutions at iteration t , v(.) is the random variation operator (crossover and mutation), s(.) is the selection for reproduction operator, and r is the replacement operator which determines which of the parents and children survive into the next generation. Therefore the algorithm turns one population of candidate solutions into another, using selection, crossover and mutation. Selection exploits information in the current population, concentrating interest on 'high-fitness' solutions. Crossover and mutation perturb these solutions in an attempt to uncover better solutions, and these operators can be considered as general heuristics for exploration.
GE is a grammatical approach to Genetic Programming (GP) that can evolve computer programs (or rulesets) in any language, and a full description of GE can be found in [1, 16, 17, 18]. Rather than representing the programs as syntax trees, as in Koza's GP [19], a linear genome representation is used. Each individual, a variable length binary string, contains in its codons (groups of 8 bits) the information to select production rules from a Backus Naur Form (BNF) grammar. In other words, an individual's binary string contains the instructions that direct a developmental process resulting in the creation of a program or rule. As such, GE adopts a biologically-inspired, genotype-phenotype mapping process.
At present, the search element of the system is carried out by an evolutionary algorithm, although other search strategies with the ability to operate over binary or integer strings have also been used [1, 5]. The GE system possesses a modular structure (see figure 1) which will allow future advances in the field of evolutionary algorithms to be easily incorporated.
3.1 The Biological Approach
The GE system is inspired by the biological process of generating a protein from the genetic material of an organism. Proteins are fundamental in the proper development and operation of living organisms and are responsible for traits such as eye color and height [20].
The genetic material (usually DNA) contains the information required to produce specific proteins at different points along the molecule. For simplicity, consider DNA to be a string of building blocks called nucleotides, of which there are four, named A, T, G, and C, for adenine, tyrosine, guanine, and cytosine respectively. Groups of three nucleotides, called codons, are used to specify the building blocks of proteins. These protein building blocks are known as amino acids, and the sequence of these amino acids in a protein is determined by the sequence of codons on the DNA strand. The sequence of amino acids is very important as it determines the final three-dimensional structure of the protein, which in turn has a role to play in determining its functional properties. In order to generate a protein from the sequence of nu-cleotides in the DNA, the nucleotide sequence is first transcribed into a slightly different format, that being a sequence of elements on a molecule known as mRNA. Codons within the mRNA molecule are then translated to determine the sequence of amino acids that are contained within the protein molecule. The application of production rules to the non-terminals of the incomplete code being mapped in GE is analogous to the role amino acids play when being combined together to transform the growing protein molecule into its final functional three-dimensional form.
The result of the expression of the genetic material as proteins in conjunction with environmental factors is the phenotype. In GE, the phenotype is a sentence or sentences in the language defined by the input grammar. These sentences can take the form, for example, of functions, programs, or as in the case of this study, rule sets. The pheno-type is generated from the genetic material (the genotype) by a process termed a genotype-phenotype mapping. This is unlike the standard method of generating a solution directly from an individual in an evolutionary algorithm by explicitly encoding the solution within the genetic material. Instead, a many-to-one mapping process is employed within which the robustness of the GE system lies. Figure 2 compares the mapping process employed in both GE
Figure 1: Modular structure of grammatical evolution
Grammatical Evolution
Binary String
TRANSCRIPTION
Biological System
XXXXXXXXX dna
I
Integer String
TRANSLATION
Program / [Function I
Executed Program
Phenotypic Effect
N = { <expr>, <op>, <pre_op> } T = {Sin, +, -, /, *, X, 1.0, (, )} S = <expr>
And P can be represented as:
(A)	<expr> ::= <expr> <op> <expr>	(0)
| ( <expr> <op> <expr> ) (1) | <pre-op> ( <expr> )	(2)
| <var>	(3)
(B)	<op> ::= +	(0)
| - (1)
| / (2) | *	(3)
(C)	<pre-op> ::= Sin
(D)	<var> ::= X	(0)
| 1.0 (1)
Figure 2: A comparison between the grammatical evolution system and a biological genetic system. The binary string of GE is analogous to the double helix of DNA, each guiding the formation of the phenotype. In the case of GE, this occurs via the application of production rules to generate the terminals of the compilable program. In the biological case by directing the formation of the phenotypic protein by determining the order and type of protein subcomponents (amino acids) that are joined together.
and biological organisms.
3.2 The Mapping Process
When tackling a problem with GE, a suitable BNF (Backus Naur Form) grammar definition must first be defined. The BNF can be either the specification of an entire language or, perhaps more usefully, a subset of a language geared towards the problem at hand.
In GE, a BNF definition is used to describe the output language to be produced by the system. BNF is a notation for expressing the grammar of a language in the form of production rules. BNF grammars consist of terminals, which are items that can appear in the language, e.g. binary operators +, -, unary operators Sin, constants 1.0 etc. and non-terminals, which can be expanded into one or more terminals and non-terminals. For example from the grammar detailed below, <expr> can be transformed into one of four rules, i.e it becomes <expr><op><expr>, (<expr><op><expr>) (which is the same as the first, but surrounded by brackets), <pre-op>(<expr>) , or <var>. A grammar can be represented by the tuple {N,T,P, 5}, where N is the set of non-terminals, T the set of terminals, P a set of production rules that maps the elements of N to T, and S is a start symbol which is a member of N. When there are a number of productions that can be applied to one element of N the choice is delimited with the symbol. For example,
The program, or sentence(s), produced will consist of elements of the terminal set T. The grammar is used in a developmental approach whereby the evolutionary process evolves the production rules to be applied at each stage of a mapping process, starting from the start symbol, until a complete program is formed. A complete program is one that is comprised solely from elements of T.
As the BNF definition is a plug-in component of the system, it means that GE can produce code in any language thereby giving the system a unique flexibility. For the above BNF, table 2 summarizes the production rules and the number of choices associated with each.
Rule no.	Choices
A	4
B	4
C	1
D	2
Table 2: The number of choices available from each production rule.
The genotype is used to map the start symbol onto terminals by reading codons of 8 bits to generate a corresponding integer value, from which an appropriate production rule is selected by using the following mapping function:
Rule = Codon Value % No. Rule Choices (2)
where % is the MOD function which returns the remainder after a division operation (e.g. 4 % 3 = 1). Consider the following rule from the given grammar for the non-terminal op. There are four possible production rules for this nonterminal.
(B) <op>
|/ |*
(0) (1) (2)
(3)
If we assume the codon being read produces the integer 6, then
6%4 = 2
RNA
Rules
Acids
Protein
+
would select rule (2), the operator /. Each time a production rule has to be selected to transform a non-terminal, another codon is read. In this way the system traverses the genome.
During the genotype-to-phenotype mapping process, it is possible for individuals to run out of codons, and in this case we wrap the individual and reuse the codons. This is quite an unusual approach in EAs, as it is entirely possible for certain codons to be used two or more times. This technique of wrapping the individual draws inspiration from the gene-overlapping phenomenon that has been observed in many organisms [20].
In GE, each time the same codon is expressed it will always generate the same integer value, but, depending on the current non-terminal to which it is being applied, it may result in the selection of a different production rule. This feature is referred to as intrinsic polymorphism. Crucially, however, each time a particular individual is mapped from its genotype to its phenotype, the same output is generated. This is the case because the same choices are made each time. However, it is possible that an incomplete mapping could occur, even after several wrapping events, and in this case the individual in question is given the lowest possible fitness value. The selection and replacement mechanisms then operate accordingly to increase the likelihood that this individual is removed from the population.
An incomplete mapping could arise if the integer values expressed by the genotype were applying the same production rules repeatedly. For example, consider an individual with three codons, all of which specify rule 0 from below,
(A) <expr>
<expr><op><expr>	(0)
(<expr><op><expr>)	(1)
<pre-op>(<expr>)	(2)
<var>	(3)
even after wrapping the mapping process would be incomplete and would carry on indefinitely unless stopped. This occurs because the nonterminal <expr> is being mapped recursively by production rule 0, so it becomes <expr><op><expr>. Therefore, the leftmost <expr> after each application of a production would itself be mapped to a
<expr><op><expr>, resulting in an expression continually growing as follows: <expr><op><expr><op><expr><op><expr> and so on.
Such an individual is dubbed invalid as it will never undergo a complete mapping to a set of terminals. For this reason we impose an upper limit on the number of wrapping events that can occur. It is clearly essential that stop sequences are found during the evolutionary search in order to complete the mapping process to a functional program. The stop sequence being a set of codons that result in the non-terminals being transformed into elements of the grammars terminal set.
Beginning from the left hand side of the genome then, codon integer values are generated and used to select rules
from the BNF grammar, until one of the following situations arise:
1.	A complete program is generated. This occurs when all the non-terminals in the expression being mapped are transformed into elements from the terminal set of the BNF grammar.
2.	The end of the genome is reached, in which case the wrapping operator is invoked. This results in the return of the genome reading frame to the left hand side of the genome once again. The reading of codons will then continue, unless an upper threshold representing the maximum number of wrapping events has occurred during this individual's mapping process.
3.	In the event that a threshold on the number of wrapping events has occurred and the individual is still incompletely mapped, the mapping process is halted, and the individual is assigned the lowest possible fitness value.
To reduce the number of invalid individuals being passed from generation to generation, a steady state replacement mechanism is employed. One consequence of the use of a steady state method is its tendency to maintain fit individuals at the expense of less fit, and in particular, invalid individuals.
4 Experimental approach
The dataset consists of financial data of 791 non-financial US companies drawn from the S&P Compustat database. The associated S&P overall credit rating for each corporate bond issuer is also obtained from the database.1
Of these companies, 57% have an investment rating (AAA, AA, A, or BBB), and 43% have a junk rating. To allow time for the preparation of year-end financial statements, the filing of these statements with the Securities and Exchange Commission (S.E.C), and the development of a bond rating opinion by Standard and Poor rating agency, the bond rating of the company as at 30 April 2000, is matched with financial information drawn from their financial statements as at 31 December 1999. A subset of 600 firms was randomly sampled from the total of 791 firms, to produce two groups of 300 'investment' grade and 300 junk rated firms. The 600 firms were randomly allocated to the training set (420) or the hold-out sample (180), ensuring that each set was equally balanced between investment and non-investment grade ratings.
A total of eight financial variables was selected for inclusion in this study. The selection of these variables was guided both by prior literature in bankruptcy prediction [21, 22, 23], literature on bond rating prediction
1 S&P is one of the largest credit rating agencies in the world, currently rating about 150,000 issues of securities across 50 countries. It provides credit ratings for about 99.2% of the debt obligations and preferred stock issues which are publicly traded in the US [8].
[14, 24, 25], resulting in an initial judgemental selection of a subset of accounting ratios. These ratios were then further filtered using statistical analysis.
Five groupings of explanatory variables, drawn from financial statements, are given prominence in prior literature as being the prime determinants of bond issue quality and default risk:
i.	Liquidity
ii.	Debt
iii.	Profitability
iv.	Activity / Efficiency
v.	Size
Liquidity refers to the availability of cash resources to meet short-term cash requirements. Debt measures focus on the relative mix of funding provided by shareholders and lenders. Profitability considers the rate of return generated by a firm, in relation to its size, as measured by sales revenue and/or asset base. Activity measures consider the operational efficiency of the firm in collecting cash, managing stocks and controlling its production or service process. Firm size provides information on both the sales revenue and asset scale of the firm and also provides a proxy metric on firm history. The groupings of potential explanatory variables can be represented by a wide range of individual financial ratios, each with slightly differing information content. The groupings themselves are interconnected, as weak (or strong) financial performance in one area will impact on another. For example, a firm with a high level of debt, may have lower profitability due to high interest costs. Following the examination of a series of financial ratios under each of these headings, the following inputs were selected:
i.	Current ratio
ii.	Retained earnings to total assets
iii.	Interest coverage
iv.	Debt ratio
v.	Net margin
vi.	Market to book value
vii.	Log (Total assets)
viii.	Return on total assets
The objective in selecting a set of proto-explanatory variables is to choose financial variables that vary between companies in different bond rating classes, and where information overlaps between the variables are minimised. Comparing the means of the above ratios for the two groups of ratings (see table 3), reveals a statistically significant difference between the two groups at both the 5% and the 1% level, and as expected, the financial ratios in each case, for the investment ratings are stronger than those for the junk ratings. The only exception is the current ratio, which is stronger for the junk rated companies, possibly indicating a preference for these companies to hoard short-term liquidity, as their access to long-term capital markets is weak.
A correlation analysis between the selected ratios (see table 4) indicates that most of the cross-correlations are less than I 0.20 |, with the exception of the debt ratio and (Retained Earnings/Total Assets) ratio pairing, which has a correlation of-0.64.
In this study, the GE algorithm uses a steady state replacement mechanism, such that, two parents produce two children the best of which replaces the worst individual in the current population, if the child has greater fitness. The standard genetic operators of bit mutation (probability of 0.01), and variable-length one-point crossover (probability of 0.9) are adopted. A series of functions, are pre-defined as are a series of mathematical operators. A population of initial rule-sets (programs) are randomly generated, and by means of an evolutionary process, these are improved. No explicit model specification is assumed ex-ante, although the choice of mathematical operators defined in the grammar do place implicit limitations on the model specifications amongst which GE can search. The grammar adopted in this study is as follows:
<lc> ::= if( <expr> <relop> <expr> ) class=''Junk''; else
class=''Investment Grade'';
<expr> ::= ( <expr> ) + ( <expr> ) | <coeff> * <var>
<var> ::= Current_Ratio
| Retained_Earnings_to_total_assest | Interest_Coverage | Debt_Ratio | Net_Margin | Market_to_book_value | Total_Assets | ln(Total_Assets) | Return_on_total_assets
<coeff> ::= ( <coeff> ) <op> ( <coeff> ) | <float>
<op> ::= + | - | *
<float> ::=9|8|7|6|5|4 | 3 | 2 | 1 | -1 | .1
<relop> ::= <=
5 Results
The results from our experiments are now provided. Each of the GE experiments is run for 100 generations, with variable-length, one-point crossover at a probability of 0.9, one point bit mutation at a probability of 0.01, roulette selection, and steady-state replacement. Results are reported for two population sizes (500 and 1000). To assess the stability of the results across different randomisations of the dataset between training and test data, we recut the dataset five times, maintaining an equal balance of investment and non-investment grade ratings in the resulting training and test datasets.
Investment grade Junk bond
Current ratio	1.354	1.93
Ret. earn/Tot assets	0.22	-0.12
Interest coverage	7.08	1.21
Debt ratio	0.32	0.53
Net margin	0.07	-0.44
Market to book value	18.52	4.02
Total assets	10083	1876
Return on total assets	0.10	0.04
Table 3: Means of input ratios for investment and junk bond groups of companies.
	CR	RE/TA	IC	DR	NM	MTB	TA	ROA
CR	1	-0.08	-0.01	0.06	-0.27	0.01	-0.18	-0.15
RE/TA	-0.08	1	0.27	-0.64	0.14	0.15	0.15	0.48
IC	-0.01	0.27	1	-0.28	0.06	0.31	0.15	0.41
DR	0.06	-0.64	-0.28	1	-0.05	-0.19	-0.20	-0.27
NM	-0.27	0.14	0.06	-0.05	1	0.01	0.03	0.22
MTB	0.01	0.15	0.31	-0.19	0.01	1	0.04	0.14
TA	-0.18	0.15	0.15	-0.20	0.03	0.04	1	0.07
ROA	-0.15	0.48	0.41	-0.27	0.22	0.14	0.07	1
Table 4: Correlations between financial ratios.
In our experiments, fitness is defined as the number of correct classifications obtained by an evolved discriminant rule. The results for the best individual of each cut of the dataset, where 30 independent runs were performed for each cut, averaged over all five randomisations of the dataset, for both the 500 and 1000 population sizes, are given in table 5. In each case the overall classification accuracy is provided, and this is then subdivided into the number of true positives Ntp, the number of true negatives Ntn, and the number of false positives, and false negatives respectively (Nfp, Nf„).
To assess the overall hit-ratio of the developed models (out-of-sample), Press's Q statistic [26] was calculated for each model. In all cases, the null hypothesis, that the out-of sample classification accuracies are not significantly better than those that could occur by chance alone, was rejected at the 1% level. A t-test of the hit-ratios also rejected a null hypothesis that the classification accuracies were no better than chance at the 1% level. Across all the data recuts, the best individual achieved an 87.56 (84.36)% accuracy in-sample (out-of-sample) when the population size was 500, with the best individual across all data recuts in the population=1000 case obtaining an accuracy of 87.59 (84.92)% accuracy in-sample (out-of-sample). Although the average out-of-sample accuracy obtained for popula-tion=1000 slightly exceeds that for population=500, the difference was not found to be statistically significant. A plot of the best and average fitness on each cut of the insample dataset, for the population=500 case, can be seen in figure 3, and for case where population=1000 in figure 4.
Examining the structure of the best individual in the case where the initial fitness function was utilised and where population=500 shows that the evolved discriminant function had the following form:
IF (10 + 16 var6 -9 var4 -2 var9) > 0 THEN 'Junk' ELSE 'Investment Grade'
where var6 is Debt Ratio, var4 is R^tT!nta ^^^n^s and
'	Total Assets '
var9 is Total Assets.
In the case where population=1000 the best evolved discriminant function had a similar form to the above:
IF (5 + 8 var6 -4 var4 - var9) > 0 THEN 'Junk' ELSE 'Investment Grade'
Examining the signs of the coefficients of the evolved rules does not suggest that they conflict with common financial intuition. The rules indicate that low/negative retained earnings, low/negative total assets or high levels of debt finance are symptomatic of a firm that has a junk rating. It is noted that similar risk factors have been identified in predictive models of corporate failure which utilise financial ratios as explanatory inputs [3, 4]. Conversely, low levels of debt, a history of successful profitable trading, and high levels of total assets are symptomatic of firms that have an investment grade rating. Although the two discriminant functions have differing coefficient values, they are in essence very similar, as the differing coefficient values are balanced by the differing constant term which has been evolved in each function.
Considering the individual classification rules, it interesting that despite the potential to generate long, complex ratio chains, this bloating did not occur and the evolved classifiers are reasonably concise in form. We also note that the evolved classifiers (unlike those created by means of a neural network methodology, for example) are amenable to human interpretation.
	Fitness	TP	TN	FP	FN
Train GEBOND500	0.861	185.4	176.4	33.6	24.6
Train GEBOND1000	0.867	183.4	180.8	29.2	26.6
Out-Sample GEBOND500	0.854	77.8	76	14	12.2
Out-Sample GEBOND1000	0.860	78	76.8	13.2	12
Train MLP	0.8690	181.8	183.2	26.8	28.2
Out-sample MLP	0.8500	75.8	77.2	12.8	14.2
Table 5: Performance of the best evolved rules on their training and out-of-sample datasets, averaged over all five randomisations, compared with the classification performance of an MLPs on same datasets.
Grammatical Evolution - Bond Rating
Grammatical Evolution - Bond Rating
0.84
0.83
0.82
0.81
^ 0.79 -
0.78
0.77
0.76
0.75
0.74
40	60
Generation
40	60
Generation
Figure 3: Best and average fitness values of 30 runs on the five recuts of the in-sample dataset with a population size of 500.
0
20
80
100
0
20
80
100
Grammatical Evolution - Bond Rating
Grammatical Evolution - Bond Rating
0.83
0.82
0.79
0.78 f;
0.77,,:
40	60
Generation
40	60
Generation
Figure 4: Best and average fitness values of 30 runs on the five recuts of the in-sample dataset with a population size of 1000.
5.1 Benchmarking the Results
To provide a benchmark for the results obtained by GE, we compare them with the results obtained on the same recuts of the dataset, using a fully-connected, three-layer, feedforward multi-layer perceptron (MLP) trained using the back-propagation algorithm, and with the results obtained using linear discriminant analysis.
The developed MLP networks utilised all the explanatory variables. The optimal number of hidden-layer nodes was found following experimentation on each separate data recut, and varied between two and four nodes. The classification accuracies for the networks, averaged over all five recuts is provided in table 5.
The levels of classification accuracy obtained with the MLP are competitive with earlier research, with for example [14] obtaining an out-of-sample classification accuracy of approximately 83.3%, although it is noted that the size of the dataset in this study was small. Comparing the results from the MLP with those of GE on the initial fitness function suggests that GE has proven highly competitive with an MLP methodology, producing a similar classification accuracy on the training data, and slightly out-performing the MLP out-of-sample.
Utilising the same dataset recuts as GE, LDA produced
results (averaged across all five recuts) of 82.74% insample, and 85.22% out-of-sample. Again, GE is competitive against these results in terms of classification accuracy. Comparing the results obtained by the linear classifiers (LDA and GE) against those of an MLP suggests that strong non-linearities between the explanatory variables and the dependent variable are not present.
6 Conclusions & future work
In this paper a novel methodology, GE, was described and applied for the purposes of prediction of bond ratings. It is noted that this novel methodology has general utility for rule-induction applications. GE was found to be able to evolve quality classifiers for bond ratings from raw financial information. Despite using data drawn from companies in a variety of industrial sectors, the developed models showed an impressive capability to discriminate between investment and junk rating classifications. The GE-developed models also proved highly competitive with a series of MLP models developed on the same datasets.
Several extensions of the methodology in this study are indicated for future work. One route is the inclusion of non-financial company and industry-level information as
0
20
80
100
0
20
80
100
input variables. A related possibility would be to concentrate on building rating models for individual industrial sectors. The study can also be extended to encompass the multi-class rating prediction problem. As already noted, there are multiple methodologies available for the generation of classification rules / regression models [27, 28]. Future work could extend this study by examining the general utility of GE vs other methods of generating classification rules, by comparing the performance of a range of methods on a wider range of datasets.
References
[1]	O'Neill M. and Ryan C. (2003) Grammatical Evolution: Evolutionary Automatic Programming in an Arbitrary Language. Kluwer Academic Publishers 2003.
[2]	Bond Market Statistics (2003). New York: The Bond Market Association.
[3]	Brabazon, A. and O'Neill. M. (2004). Diagnosing Corporate Stability using Grammatical Evolution, International Journal of Applied Mathematics and Computer Science, 14(3), pp. 363-374.
[4]	Brabazon, A. and O'Neill, M. (2003). Anticipating Bankruptcy Reorganisation from Raw Financial Data using Grammatical Evolution, Proceedings of EvoIASP 2003, Lecture Notes in Computer Science (2611): Applications of Evolutionary Computing, edited by Raidl, G., Meyer, J.A., Middendorf, M., Cagnoni, S., Cardalda, J. J. R., Corne, D. W., Gottlieb, J., Guillot, A., Hart, E., Johnson, C. G., Mar-chiori, E., pp. 368-378, Berlin: Springer-Verlag.
[5]	O'Neill, M. and Brabazon, A. (2004). Grammatical Swarm, in Proceedings of the Genetic and Evolutionary Computation Conference GECCO 2004, Lecture Notes in Computer Science (3102), Deb et. al. (eds.), Seattle, USA, June 26-30, 2004, 1, pp. 163174, Berlin: Springer-Verlag.
[6]	Shan, Y., McKay, R., Baxter, R., Abbass, H., Essam, D. and Nguyen, H. (2003). Grammar Model Based Program Evolution, in Proceedings of the 2004 IEEE Congress on Evolutionary Computation, 1, pp. 478485, IEEE Press: New Jersey.
[7]	Altman, E. (1998). The importance and subtlety of credit rating migration, Journal of Banking & Finance, 22, pp. 1231-1247.
[8]	Standard and Poor's (2002). Standard and Poor's Rating Services, Statement at US SEC Public Hearing on the Role and Function of Credit Rating Agencies in the US Securities Markets, 15 November 2002.
[9] Horrigan, J. (1966). The determination of long term credit standing with financial ratios, Journal of Accounting Research, (supplement 1966), pp. 44-62.
[10]	Pinches, G. and Mingo, K. (1973). A multivariate analysis of industrial bond ratings, Journal of Finance, 28(1), pp. 1-18.
[11]	Ederington, H. (1985). Classification models and bond ratings, Financial Review, 20(4), pp. 237-262.
[12]	Gentry, J., Whitford, D. andNewbold, P. (1988). Predicting industrial bond ratings with a probit model and funds flow components, Financial Review, 23(3), pp. 269-286.
[13]	Maher, J. and Sen, T. (1997). Predicting bond ratings using neural networks: a comparison with logistic regression, Intelligent Systems in Accounting, Finance and Management, 6, pp. 23-40.
[14]	Dutta, S. and Shekhar, S. (1988). Bond rating: a nonconservative application of neural networks, Proceedings of IEEE International Conference on Neural Networks, II, pp. 443-450.
[15]	Fogel, D. (2000). Evolutionary Computation: Towards a new philosophy of machine intelligence, New York: IEEE Press.
[16]	O'Neill, M. (2001). Automatic Programming in an Arbitrary Language: Evolving Programs in Grammatical Evolution. PhD thesis, University of Limerick, 2001.
[17]	O'Neill M. and Ryan C. (2001) Grammatical Evolution, IEEE Trans. Evolutionary Computation. 2001.
[18]	Ryan C., Collins J.J. and O'Neill M. (1998). Grammatical Evolution: Evolving Programs for an Arbitrary Language. Lecture Notes in Computer Science 1391, Proceedings of the First European Workshop on Genetic Programming, pp. 83-95, Springer-Verlag.
[19]	Koza, J. (1992). Genetic Programming. MIT Press.
[20]	Lewin, Benjamin. (2000). Genes VII. Oxford University Press.
[21]	Altman, E. (1993). Corporate Financial Distress and Bankruptcy, New York:John Wiley and Sons Inc.
[22]	Morris, R. (1997). Early Warning Indicators of Corporate Failure: A critical review of previous research and further empirical evidence, London: Ashgate Publishing Limited.
[23]	Altman, E. (1968). Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy, Journal ofFinance, 23, pp. 589-609.
[24]	Kamstra, M., Kennedy, P. and Suan, T.K. (2001). Combining Bond Rating Forecasts Using Logit, The Financial Review , 37, pp. 75-96.
[25]	Singleton, J. and Surkan, A. (1991). Modeling the Judgment of Bond Rating Agencies: Artificial Intelligence Applied to Finance, Journal of the Midwest Finance Association, 20, pp. 72-80.
[26]	Hair, J., Anderson, R., Tatham, R. and Black, W. (1998). Multivariate Data Analysis, Upper Saddle River, New Jersey: Prentice Hall.
[27]	Breiman, L., Freidman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees, New York: Chapman and Hall.
[28]	Torgo, L. (2000). Partial Linear Trees, in Proceedings of the 17th International Conference on Machine Learning (ICML 2000), Langley, P. (ed), pp. 10071014, Morgan Kauffman.
Integration of Bargaining into E-Business Systems
Heinrich C. Mayr
University of Klagenfurt, Department of Business Informatics and Application Systems,
Universitätsstr. 65-67, 9020 Klagenfurt, Austria
E-mail: mayr@ifit.uni-klu.ac.at, http://www.ifi.uni-klu.ac.at/IWAS
Klaus-Dieter Schewe
Massey University, Department of Information Systems & Information Science Research Centre, Private Bag 11 222, Palmerston North, New Zealand E-mail: k.d.schewe@massey.ac.nz, http://isrc.massey.ac.nz
Bernhard Thalheim
Christian Albrechts University Kiel, Department of Computer Science and Applied Mathematics, Olshausenstr. 40, 24098 Kiel, Germany
E-mail: thalheim@is.informatik.uni-kiel.de, http://www.is.informatik.uni-kiel.de Tatjana Welzer
University of Maribor, Institute of Informatics, Smetanova ul. 17, 2000 Maribor, Slovenia E-mail: welzer@uni-mb.si, http://lisa.uni-mb.si
Keywords: bargaining, co-design, e-business, games, web information systems Received: October 24, 2006
Despite the fact that bargaining plays an important role in business communications, it is largely neglected in e-business systems. In this paper a conceptual model that integrates bargaining into web-based e-business systems will be developed starting from an informal characterisation of the bargaining process. Bargaining can be formalised as a two-playergame, and integrated with the co-design approach for the design of web information systems. In this way bargaining games are played on parameterised story spaces, such that each move of a player adds constraints to the parameters. Each player follows a strategy for making moves, and winning strategies are characterised by highly-ranked agreements.
Povzetek: Opisano je uvajanje pogajanja v sistem e-poslovanja.
1 Introduction	flow [8], e-payment [2], trust [4], decision support [3], or
web services [12].
Bargaining plays an important role in business communi-	In this paper we make an attempt to integrate bargain-cations. For instance, in commerce it is common to bargain	ing into web-based e-business systems using the co-design about prices, discounts, etc., and in banking and insurance	approach [11] to the design of web information systems bargaining about terms and conditions applies. E-business	(WISs). We start with a characterisation of the bargaining aims at supporting business with electronic media, in par-	process as an interaction between at least two parties. The ticular web-based systems. These systems support, com-	cornerstones of this characterisation are goals, acceptable plement or even replace human labour that would normally	outcomes, strategies, secrets, trust and distrust, and pref-be involved in the process. In [10] it has been outlined that	erences. We believe that before dropping into formal desuch systems can only be developed successfully, if the hu-	tails of a conceptual model for bargaining, we first need a man communication behaviour is well understood, so that	clearer picture of what we are aiming at. We will discuss it can become part of an electronic system. Bargaining is	the characteristics of bargaining in Section 2 following pre-part of that communication behaviour.	vious work in [9]. We will also outline the differences to
However, bargaining is largely neglected in e-business. auction systems.
In business-oriented literature, e.g. [6,13] secure payments	In Section 3 we briefly present parts of the co-design and trust are mentioned, but negotiation latitude or bargain-	approach to WIS design [11] in order to have a simple coning do not appear. Looking at the discussion of technology	ceptual model of WISs, into which ideas concerning bar-for e-business this comes as no surprise, as the emphasis is	gaining can be implanted. We emphasise the idea of story on the sequencing of user actions and the data support, but	space as a collection of abstract locations (called scenes) almost never on inferences. For instance, favourable topics	and transitions between them that are initiated by actions, in e-business modelling are business processes [1], work-	the support of the scenes by database views, and the sup-
port of the actions by operations associated with the views. Though many aspects of the co-design approach will be omitted in this model, it will suffice to serve as a basis for a formalisation of bargaining.
In Section 4 we develop a model for bargaining based on games that are played on the conceptual model. This idea was already presented in [9], though only in a completely informal way. We concentrate on bargaining involving only two parties. Their "playground" will be the parameterised story space, and moves consist of adding constraints to the parameters. The moves of the players reflect offers, counteroffers, acceptance and denial. Both players aim at an optimal outcome for themselves, but success is defined as outcomes that are acceptable for both parties. Furthermore, players follow bargaining strategies that may lead them to a final agrement. We will characterise such strategies and attempt to define what a "winning strategy" might be, though obviously bargaining games do not end with one party winning and the other one losing. Furthermore, we characterise the context of bargaining as being defined by user profiles including preferences and desires, and bargaining preferences.
In e-business systems the role of one player will be taken by a user, while the system plays the other role. This may be extended to a multiple-player game with more than one single human player, e.g. if bargaining becomes too critical to leave it exclusively to a system.
2 Characteristics of the Bargaining Process
Let us start looking at human bargaining processes. We consider two typical bargaining situations in a commerce application and a loan application. From these examples we derive characteristic features of bargaining.
2.1 Examples of Bargaining
In a typical commerce situation a customer may enter into bargaining over the total price of an order consisting of several goods, each with its particular quantity. The seller might have indicated a price, but as the order will lead to substantial turnover, he is willing to enter into bargaining. The goal of the purchaser is to reduce the total price as much as possible, i.e. to bargain a maximal discount, while the seller might want to keep the discount below a certain threshold. Both parties may be willing to accept additional items added to the order for free. This defines optimal and acceptable outcomes for both sides.
However, none of the two parties may play completely with open cards, i.e. the seller may try to hide the maximal discount he could offer, while the purchaser may hide the limit price he is willing to accept. Both parties may also try to hide their preferences, e.g. whether an add-on to the order or a discount is really the preferred option. It may even be the case that adding a presumably expensive item
to the order is acceptable to the seller, while the latitude for a discount is much smaller, e.g. if the add-on item does not sell very well. So, both parties apply their own strategies to achieve the best outcome for them.
The bargaining process then consists of making offers and counteroffers. Both offers and counteroffers narrow down the possible outcomes. For instance, an offer by the seller indicating a particular discount determines already a maximal price. The purchaser may not be happy with the offer, i.e. the price is not in the set of his/her acceptable outcomes, therefore request a larger discount. Bargaining first moves into the set of mutually acceptable outcomes, finally achieves an agreement, i.e. a contract. Bargaining outside the latitude of either party may jeopardise the whole contract or require that a human agent takes over the bargaining task.
Similar price bargaining arises in situations, when real estate, e.g. a house is sold.
In loan applications, i.e. both personal loans and mortgages [10] the bargaining parties aim at acceptable conditions regarding disagio, interest rate, securities, duration, bail, etc. The principles are the same as for price bargaining, but the customer may bring in evidence of offers from competing financial institutions.
As a loan contract binds the parties for a longer time than a one-off sale, it becomes also important that the bargaining parties trust each other. The bank must be convinced that the customer will be able to repay the loan, and the customer must be convinced that the offer made is reasonable and not an attempt to achieve extortionate conditions. In this case the set of acceptable outcomes is also constrained by law.
Figure 1 illustrates the characteristics of bargaining using a mindmap.
2.2 Formal Ingredients
In order to obtain a conceptual model from these examples let us try to extract the formal ingredients of the bargaining process. From now on we concentrate on the case that only two parties are involved in the bargaining.
First of all there is the object of the bargaining, which can be expressed by a parameterised view. In case of the sales situation this object is the order, which can be formalised by a set of items, each having a quantity, a price, and a discount, plus a discount for the total order. At the beginning of bargaining processes the set contains just the items selected by the customer, and all discounts are set to 0. During the bargaining process items may be added to the order, and discounts may be set. Similarly, in the loan bargaining situation the object is the loan, which is param-eterised by interest rate, disagio, and duration, and the set of securities, some of which might belong to bailsmen, in which case the certification of the bailsmen becomes part of the bargaining object.
The set of acceptable outcomes is obtained by instantiations of the bargaining object. These instantiations are
Figure 1: Mindmap for Bargaining Characteristics
expressed by static constraints for each party. However, the constraints are not visible to the other party. They can only be inferred partially during the bargaining process. In addition to the constraints of each party there are general constraints originating from law or other agreed policies. These general constraints are visible to both parties, and they must not be violated.
In case of the sales situation a constraint on the side of the purchaser might be a maximal acceptable price for the original order, or it might be expressed by a minimum discount in terms of any extended order. It may also be the case that the discount is expressed by a function on the set of added items, e.g. the more items are added to the order, the higher the acceptable discount must be. In case of the loan situation constraints on side of the customer can be a maximal load issued by repayments or a maximal value of securities offered. For the bank a minimum level of security and a minimum real interest rate might define their acceptable outcomes.
Within the set of acceptable outcomes of either party the outcomes are (partially) ordered according to preferences. For any artificial party these preferences have to be part of the system specification. For instance, in the sales situation the lower the total price, the better is the outcome for the purchaser (inverse for the seller), and an offer with more
additional items is higher ranked. However, whether an offer with additional items and a lower discount is preferred over a large discount, depends on the individual customer and his/her goals.
An agreement is an outcome that is acceptable to both parties. Usually, bargaining terminates with an agreement, alternatively with failure.
The primary goal of each party is to achieve an agreement that is as close as possible to a maximum in the corresponding set of acceptable results. However, bargaining may also involve secondary goals such as binding a customer (for the seller or the bank). These secondary goals influence the bargaining strategy in a way that the opposite party considers offers made to be fair and the agreement not only acceptable, but also satisfactory. This implies that constraints are classified in a way that some stronger constraints define satisfactory outcomes. This can be extended to more than just two levels of outcomes. In general, the bargaining strategy of each party is representable as a set of rules that determine the continuation of the bargaining process in terms of the offers made by the other party.
The bargaining process runs as a sequence of offers and counteroffers started by one party. Thus, in principle bargaining works in the same way as a two-player game such as Chess or Go. Each offer indicates an outcome the of-
fering party is willing to accept. Thus it can be used to reduce the set of acceptable outcomes of the other party. For instance, if the seller offers a discount, then all outcomes with a smaller discount can be neglected. Similarly, if the purchaser offers a price he is willing to pay, the seller can neglect all lower prices.
Nevertheless, there is a significant difference to normal two-player games, as in bargaining there is no direct analogue of the concept of winner. If there is no agreement, both parties lose, and both may consider themselves as winners, if there is an agreement. We may say that a party considers itself the winner, if the agreement is perceived as being better for the own side. Such a characterisation may help to formalise "winning strategies".
Furthermore, each party may indicate acceptable outcomes to the opposite party without offering them. Such playing with open cards indicates trust in the other party, and is usually used as a means for achieving secondary (non-functional) goals. In the following we will not not consider this possibility, i.e. we concentrate on bargaining with maximal hiding.
In summary, we can characterise bargaining by the bargaining object, constraints for each participating party defining acceptable outcomes, partial orders on the respective sets of possible outcomes, and rules defining the bargaining strategy of each party. In the following we will link these ingredients of a bargaining process to the conceptual model of e-business systems that is offered by the co-design method.
Note that bargaining is significantly different from auctioning system. The latter ones, e.g. the eBay system (see http://www.ebay.com) offer products, for which interested parties can put in a bid. If there is at least one acceptable bid, usually the highest bid wins. Of course, each bidder follows a particular strategy and it would be challenging to formalise them, but usually systems only play the role of the auctioneer, while the bidders are users of the system.
2.3 Context of Bargaining
In addition to the outlined characteristics of the bargaining process, the attitude towards bargaining depends on a lot of contextual issues. In some cultures bargaining is an intrinsic part of business and is applied with virtually no limits, whereas in other cultures bargaining follows pre-determined rules. Incorporating bargaining into an ebusiness system has to reflect this spectrum of possible attitudes.
That is, all parties involved in a bargaining process act according to a particular personal profile that captures the general attitude towards bargaining, desires and expectations regarding the outcome of the bargaining process, preferences regarding the outcome and the behaviour of the other parties. For instance, if bargaining is offered in an arabic country, the expected latitude with respect to what can be bargained about and how much the result can de-
viate from the starting point, etc. must be set rather high. On the other hand, in a European context, bargaining will most likely be limited to rather small margins regarding price discounts, package offers, and preferential customer treatment.
Consequently, we also need an extension of the model of user profiles in [11] to capture the attitude towards bargaining. Correspondingly, the bargaining strategy pursued by the system has to be aware of the user profile. This implies that users have to be informed about the bargaining latitude in case this is rather limited.
3 The Co-Design Approach to Web Information Systems
If bargaining is to become an integral part of e-business systems, we first need a conceptual model for these systems. We follow the co-design approach [11], but we will only emphasise a compact model that can be used to formalise bargaining. We omit everything that deals with quality criteria, expressiveness and complexity, personalisation, adaptivity, presentation, implementation, etc., i.e. we only look at a rough skeleton of the method. In doing so, we concentrate on the story space, the plot, the views, and the operations on the views.
3.1 Story Spaces
On a high level of abstraction we may define each web information system (WIS) - thus also each e-business system - as a set of abstract locations called scenes between which users navigate. Thus, navigation amounts to transitions between scenes. Each such transition is either a simple navigation or results from the execution of an action. In this way we obtain a labelled directed graph with vertices corresponding to scenes and edges to scene transitions. The edges are labelled by action names or skip, the latter one indicating that there is no action, but only a simple navigation. This directed graph is called the story space.
Definition 3.1. The story space E consists of a finite set S of scenes, an (optional) start scene so € S, an (optional) set of final scenes F C S, a finite set A of actions, a scene assignment a : A ^ S, i.e. each action a belongs to exactly one scene, and a scene transition relation t C S x S x (AU {skip}), i.e. whenever there is a transition from scene si e S to scene s2 € S, this transition is associated with an action a e A with a(a) = si or with a = skip, in which case it is a navigation without action, and we have (si, S2,a) e t. We write E = (S, so, F, A, a, t).
Example 3.1. Take a simple example as illustrated in Figure 2, where the WIS is used for ordering products. In this case we may define four scenes.
The scene s0 = product contains product descriptions and allow the user to select products. The scene si = payment will be used to inform the user about payment
Figure 2: Story space
method options and allow the user to select the appropriate payment method. The scene s2 = address will be used to require information about the shipping address from the user. Finally, scene S3 = confirmation will be used to get the user to confirm the order and the payment and shipping details.
There are six actions (their names are sufficient to indicate what they are supposed to do): «i = select_product is defined on so and leads to no transition. a2 = payment_by_card is defined on Si and leads to a transition to scene s2. a3 = payment_by_bank_transfer is defined on s1 and leads to a transition to scene s2. a4 = payment_by_cheque is defined on si and leads to a transition to scene S2. a5 = enter_address and is defined on S2 and leads to a transition to scene S3. ae = confirm_order, is defined on S3 and leads to a transition to scene so.
In addition to the story space we need a model of the actors, i.e. user types and roles, and the tasks [11], but for our purposes here we omit this part ot the method.
3.2 Plots
With each action we may associate a pre- and a postcondition, both expressed in propositional logic with proposi-tional atoms that describe conditions on the state of the system. In doing so, we may add a more detailed level to the story space describing the flow of action. This can be done using constructors for sequencing, choice, parallelism and iteration in addition to the guards (preconditions) and post-guards (postconditions). Using these constructors, we obtain an algebraic expression describing the flow of action, which we call the plot. In [11] it has been shown that the underlying algebraic structure is the one of a Kleene algebra with tests [5], and the corresponding equational axioms can be exploited to reason about the story space and the plot on a propositional level, in particular for the purpose of personalisation.
Definition 3.2. A Kleene algebra (KA) K consists of a carrier-set K containing at least two different elements 0 and 1, a unary operation *, and two binary operations + and ■ such that + and ■ are associative, + is commutative and idempotent with 0 as neutral element, 1 is a neutral element for ■, for all p G K we have p0 = 0p = 0, ■ is distributive over +, p* q ist the least solution x of q + px < x, and qp* is the least solution of q + xp < x, using the partial order x < y = x+y = y for the last two properties.
A Kleene algebra with tests (KAT) K consists of a Kleene algebra (K, +, ■, *, 0,1), a subset B C K containing 0 and 1 and closed under + and ■, and a unary operation'on B, such that (B, +, 0,1) forms a Boolean algebra. We write K = (K,B, +, ■, 0,1).
Then a plot can be formalised by an expression of a KAT that is defined by the story space, i.e. the actions in A are elements of K, while the propositional atoms become elements of B.
Example 3.2. Continue Example 3.1. In this case we can define the plot by the expression
(a**(^ia2^2 + a3^3 + a4^4)a5(ae ^5 + 1) + 1)*
using the following conditions. Condition = price_in_range expresses that the price of the selected product(s) lies within the range of acceptance of credit card payment. It is a precondition for action a2. Condition = payment_by_credit_card expresses that the user has selected the option to pay by credit card. Analogously, condition = payment_by_bank_transfer expresses that the user has selected the option to pay by bank transfer, and condition ^4 = payment_by_cheque expresses that the user has selected the option to pay by cheque. Condition ^5 = order_confirmed expresses that the user has confirmed the order.
3.3 Media Types
On a lower level of abstraction we add data support to each scene in form of a media type, which basically is an extended view on some underlying database schema.
Definition 3.3. A media type M consists of a content data type cont(M) that may contain pairs i : M' with a label £ and the name M' of another media type, a defining query qM defining a view on some database schema, a set of operations, a set of hierarchical versions, a cohesion preorder, style options and some other extensions.
The database schema, the view formation and the extensions (except operations) are beyond our concern here, so it is sufficient to say that there is a data type associated with each scene such that in each instance of the story space the corresponding value of this type represents the data presented to the user - this is called media object in [11]. In terms of the data support the conditions used in the plot are no longer propositional atoms. They can be refined by conditions that can be evaluated on the media objects.
Analogously, the actions of the story space are refined by operations on the underlying database, which by means of the views also change the media objects. For our purposes it is not so much important to see how these operations can be specified. It is sufficient to know their parameters.
Example 3.3. Continue Example 3.1. For simplicity, let the content data type of the media type supporting scene s0 be defined as { (product_id, product_name, description,
Figure 3: Story space with bargaining action
price) }, i.e. we would present a set of products to a user, each of which defined by an id, a name, a description and a price. Then operation ai may take parameters productjd and quantity.
The condition from Example 3.2 is to express that the price of the selected products lies within the limit acceptable for credit card payment. If this price limit is a constant L, we obtain the formula
src[0,prod o (nprice X nquantity), +] (product select_product) < L.
Here we exploit that according to the given plot the action select_product will be executed several times, so we can build a relation with the same name collecting the parameters of all executions. Then we can join this relation with the product relation giving us all selected products including their quantity. The structural recursion operation selects price and quantity of each selected product, multiplies them, and adds them all up, which of course defines the total price.
Combining story space, plot and media types, we simply associate with each scene in the story space a data type, replace actions in the story space and the plot by param-eterised operations, and replace conditions in the plot by complex formulae as indicated in Example 3.3. The resulting model will be called the parameterised story space, which will serve us as the basis for formalising bargaining.
bargaining is possible, the selection of terms and conditions may become subject to a bargaining process, which will lead to an instantiated loan contract in the database - same as without bargaining. As before, the outcome of the bargaining is different from the one without bargaining, and it is obtained in a completely different way.
Therefore, in terms of the story space and the plot there is not much to change. Only some of the actions become bargaining actions. The major change is then the way these bargaining actions are refined by operations on the conceptual level of media types.
4 Bargaining as a Game
Let us now look at the specification of bargaining actions in view of the characteristics derived in Section 2. We already remarked that we can consider the bargaining process as a two-player game. Therefore, we want to model bargaining actions as games. There are now two questions that are related with this kind of modelling:
1.	What is the ground the game is played on? That is, we merely ask how the game is played, which moves are possible, and how they are represented. This of course has to take care of the history that led to the bargaining situation, the bargaining object, and the constraints on it.
2.	How will the players act? This question can only be answered for the system player, while a human player, i.e. a customer, is free in his/her decisions within constraints offered by the system. Nevertheless, we should assume that both sides - if they act reasonably - base their choices on similar grounds. The way players choose their moves will be determined by the order on the set of acceptable outcomes and the bargaining strategy.
3.4 Bargaining Actions
In our sales example bargaining could come in at any time, but for simplicity let us assume that bargaining is considered to be part of the confirmation process. That is, instead of (or in addition to) the action confirm_order we may now have an action ar = bargain_order as indicated in Figure 3. As before, the action may have a precondition, e.g. that the total price before bargaining is above a certain threshold, or the user belongs to a distinguished group of customers. If the bargaining action can be chosen, it will still result in a confirmed status of the order, i.e. the bargaining object, in the database. However, the way this outcome is achieved is completely different from the way other actions are executed. We will look into this execution model in the next section.
Similarly, in our loan example we find actions se-lect_conditions_and_terms and confirm_loan. Again, if
4.1 Bargaining Games
An easy answer to the first question could be to choose playing on the parameterised bargaining object, i.e. to consider instances of the corresponding data type. However, this would limit the possible moves in a way that no reconsideration of previous actions that led to the bargaining situation are possible. Therefore, it is better to play on the parameterised story space that we introduced in the previous section.
Each player maintains a set of static constraints on the parameterised story space. These constraints subsume
-	general constraints to the bargaining as defined by law and policies;
-	constraints determining the acceptable outcomes of the player;
-	constraints arising from offers made by the player him/herself - these offers reduce the set of acceptable outcomes;
-	constraints arising from offers made by the opponent player - these offers may also reduce the set of acceptable outcomes.
These constraints give rise to definitions of what a bargaining game is, what a state of such a game is, and which moves are possible in this game. We will now introduce these definitions step by step.
Definition 4.1. A bargaining game G consists of a param-eterised story space E, a parameterised plot P, and three sets So, and of static constraints on the parameters in E and P. We write G = (E, P, So, S^, S2).
Recall that E results from the story space as defined in Definition 3.1 by assigning a content data type of a media type to each scene, and by replacing the actions by the corresponding parameterised operations. Similarly, P results from a KAT expression as defined in Definition 3.2 by replacing atomic actions by the corresponding param-eterised operations and propositional atoms by the corresponding formulae on the underlying database schema. So formalises legal constraints, while S- formalises the acceptable outcomes for player i (i = 1,2).
Example 4.1. Let us look again at our sales example from Example 3.1. Assume that player one is the purchaser. Then a constraint in S1 may be that the total price does not exceed a particular limit, which can be formalised by a formula of the form
src[0,prod o (nprice X nquantity), +]
(product select_product) x (1 - d) < M.
Here d indicates a discount, and M might be a constant. Alternatively, the purchaser may expect a minimum discount depending on the total nominal price.
With these constraints each player obtains a set of possible instantiations that are at least acceptable to him/her. The moves of the players just add constraints. This leads to the definitions of states and moves.
Definition 4.2. A state of a bargaining game G = (E, P, S0, S1, S2) consists of a partial instance p of P with the last action leading to the bargaining scene, and two sets of S" and S^' of static constraints on the parameters in E and P, such that So U S- U S'/ are satisfiable (i = 1,2). We write s = (p, S'', S^').
Obviously, the initial state of the game is determined by the navigation of the user through the story space before reaching the bargaining state. This defines p, while S'/ and S2 are empty.
Example 4.2. In our sales example we may have a partial instance of a plot defined by p = select_product(Ì4,5)
select_product(Ì7,2) payment_by_card(... )	en-
ter_address(... ), which means that the user selected products with id-s Ì4 and ir with quantities 5 and 2, respectively, then chose payment by credit card - the omitted parameters would contain credit card number, brand, name of the card and expiry date - and finally entered a shipping address - again parameters omitted. This defines the initial state of the bargaining game.
At a later stage the purchaser may have indicated to accept a total price m. This would give rise to the constraint
src[0,prod o (nprice x ^quantity), +]
(product {(i4, 5), (ir, 2)}) x (1 - d) > m
in S''.
Definition 4.3. A run of a bargaining game G = (E, P, S0, S", S'2) is a sequence s0 ^ s" ^ ••• ^ s^ of states Si = (p-, S'/^, S^'j) satisfying the following properties:
-	So is the initial state of the game.
-	pi+i is either equal to some p j with j < i or extends
pi.
-	If i + 1 is odd, then So U S" U S'/i U S2'i must be satisfiable, and S'''(i+1) extends S'/i.
-	If i + 1 is even, then So U S2 U S2'i U S'/i must be
satisfiable, and S
2(i+1)
extends S2'i.
Each transition from si to si+1 in a run is called a move by player one or two, if i is even or odd, respectively.
So a move by a player is done by presenting an offer. For the player him/herself this offer means to indicate that certain outcomes might be acceptable, while better outcomes are not aimed at any more. This includes that a move may manipulate the bargaining object by extending the partial instance of the plot. However, a player may also reject such a change as proposed by the opponent player. In addition, constraints arising from moves will be added to the constraint sets Sj'. For instance, if a seller offers a discount and thus a total price, s/he gives away all outcomes with a higher price. For the opponent player the offer means the same, but the effect on his/her set of acceptable outcomes is different. Moves are only possible as long as the constraints arising from the counteroffers leave the latitude to retain at least one acceptable outcome.
If the set of instantiations reduces to a single element, we obtain an agreement. If it reduces to the empty set, the bargaining has failed.
Definition 4.4. A run so ^ S" ^ • • • ^ sk is called successful iff So U S" U S2 U S1'i U S2'i is satisfiable, and S'/i U S2'i is maximal with this property. In this case the instance pk of the plot in state Sk is the agreement of the bargaining game.
A bargaining game ends with an agreement, or terminates unsuccessfully, if a player cannot continue making a move.
In addition to "ordinary" moves we may allow moves that represent "last offers". A last offer is an offer indicating that no better one will be made. For instance, a total price offered by a seller as a last offer implies the constraint that the price can only be higher. However, it does not discard other options that may consist in additional items at a bargain price or priority treatment in the future. Thus, last offers add stronger constraints, which may even result the set of acceptable outcomes to become empty, i.e. failure of the bargaining process. Note that this definition of "last offer" differs from tactical play, where players indicate that the offer made is final without really meaning it. Such tactics provide an open challenge for bargaining systems.
4.2 Bargaining Strategies
By making an offer or a last offer, a player makes a move that will result in an acceptable outcome satisfying all constraints arising from counteroffers. In order to make such a choice each player uses a partial order on the set of possible outcomes. Thus, we can model this by a partial order on the set of instances of the parameterised story space. We define it by a logical formula that can be evaluated on pairs of instances.
Definition 4.5. For a bargaining game G = (E, P, So, Si, S2) the instances satisfying So U define the set of acceptable outcomes of player i (i = 1,2), denoted as Oi. The preference order of player i is a formula <i that can be evaluated on pairs of instances, such that it induces a partial order on Oi.
Then, whenever a player has to make a move, s/he will choose an offer that is not larger than any previous offer, and not smaller than any of the counteroffers made so far. This defines the reasonable offers a player can make. A bargaining strategy consists of rules determining, which offer to choose out of the set of reasonable offers. Simple ad-hoc strategies are the following:
-	A tough bargaining strategy always chooses a maximal element in the set of reasonable offers with respect to the player's partial order. If successful, a tough strategy may end up with an agreement that is nearly optimal for the player. However, a tough strategy bears the risk of long duration bargaining and last counteroffers.
-	A soft bargaining strategy is quite the opposite of a tough strategy choosing a minimal element in the set of reasonable offers with respect to the player's partial order. Soft strategies lead to fast agreements, but they almost jump immediately to accepting the first counteroffer.
- A compromise bargaining strategy aims at an agreement somewhere in the "middle" of the set of reasonable offers. Such an outcome is assumed to be mutually acceptable. The player therefore chooses an offer that lies between this compromise result and a maximal element in the set of reasonable offers, but usually more closely to the compromise than the maximum.
All these strategies are uninformed, as the only information they use are the constraints on the parameterised story space that amount to the set of reasonable offers. They do not take the counteroffers into account.
An informed bargaining strategy aims at building up a model of what is an acceptable outcome of the opponent player. For instance, if a purchaser only offers global discounts, the strategy of the seller might consist of testing, whether the purchaser would accept an increased quantity or additional items instead. If this is not the case, the seller could continue with a compromise bargaining strategy focusing exclusively on the total price. However, if the purchaser indicates that bargaining about an extended order is a possible option, the strategy might be to first increase the order volume before focussing just on the discount.
Informed bargaining strategies require to build up a model of the opponent player in terms of preference rules, thus they must be built on a heuristic inference engine.
This leads us to the final question of determining a winning strategy. According to our definition, however, bargaining games do not have winners as such. Nevertheless, we can characterise a win by the "normalised distance" of an agreement from an optimal outcome.
Definition 4.6. An assessment function for player i is a monotone function v i : O i ^ [0, Mij. An agreement a e Oi is a win for player i iff
Mi - vi (a) ^ M j - vi(a)
Mi
Mj
holds.
Then a winning strategy for player i is a bargaining strategy that will lead to a win for player i. Note that this includes that the strategy will lead to an agreement.
4.3 Bargaining Context
As indicated in Section 2 we need to incorporate the attitude towards bargaining into user profiles. In order to do so we follow the approach in [11] for modelling such profiles. Thus, assume a set A of dimensions, e.g. age, gender and cultural context, the problem solving ability, communication skills and computer literacy, the knowledge and education level regarding the task domain, the frequency and intensity of system usage, the experience in working with the system and with associated tasks, etc. For each dimension S e A we have a domain dom(S).
Definition 4.7. A user profile is an element in gr(A) = dom(S1) x ■ ■ ■ x dom(Sn). A user type over A is a subset U C gr(A).
Figure 4: Mindmap of User Profiles in Bargaining
Figure 4 illustrates the dimensions of user profiles that arise in bargaining situations. We emphasise general properties, those that are related to the application area, and knowledge and skills. The latter ones are further illustrated in Figure 5 emphasising application knowledge, problem solving skills and knowledge of technology.
In [11] the purpose of user types has been characterised by the need to associate preference rules with a user type in order to enable the personalisation of a web information system. In principle, this does not change for the case of bargaining. However, the preference rules are no longer restricted to preferences with respect to the selection of actions. For bargaining they have to refer to defining the expected bargaining space, i.e. what can be bargained about, and the expected latitude with respect to bargaining results. Again, both can be modelled by constraints on a bargaining game.
Definition 4.8. Let G = (E, P, Eo, S;, E^) be a bargaining game. A bargaining profile is defined by a pair (Ü, U) consisting of sets of static constraints on the parameters in E and P such that |= U ^ Ü holds.
In a bargaining profile (Ü, U) the first set of constraints models the expectations of a user with respect to what can be put forward in offers, i.e. any offer satisfying the constraints in Ü is considered to be eligible. The second set of constraints expresses the expectations regarding counteroffers, i.e. a user expects that the other party might accept an offer satisfying U. This explains the request that U must imply Ü.
5 Conclusion
We presented a conceptual model for bargaining in ebusiness systems on the basis of the co-design method [11].
Our model is that of a two-player game, where one part is played by a user, the other one by the e-business system. The game is played on a parameterised specification of the system. The moves of the players represent offers, counteroffers, acceptance and denial. The moves are determined by the characteristics of human bargaining processes such as goals, acceptable outcomes, strategies, secrets, trust and distrust, and preferences.
The work presented so far is only a first step towards a complete conceptual model of bargaining as part of WISs. Our future work aims at completing this model and extending the codesign method correspondingly. This includes extensions covering multi-party bargaining, bargaining with more than one role involved, as well as delegation and authority seeking within bargaining. We believe it will be advantageous to look at defeasible deontic logic [7] for these advanced goals.
References
[1]	Bergholtz, M., Jayaweera, P., Johannesson, P., and Wohed, P. Process models and business models - a unified framework. In Advanced Conceptual Modeling Techniques (2002), vol. 2784 of LNCS, SpringerVerlag, pp. 364-377.
[2]	Dai, X., and Grundy, J. C. Three kinds of e-wallets for a netpay micro-payment system. In Proceedings WISE 2004: Web Information Systems Engineering
(2004),	X. Zhou et al., Eds., vol. 3306 of LNCS, Springer-Verlag, pp. 66-77.
[3]	Hinze, A., and Junmanee, S. Providing recommendations in a mobile tourist information system. In Information Systems Technology and its Applications
(2005),	R. Kaschek, H. C. Mayr, and S. Liddle, Eds.,
vol. P-63 of Lecture Notes in Informatics, GI, pp. 86100.
[4]	J0sang, A., and Pope, S. Semantic constraints for trust transitivity. In Conceptual Modelling 2005 - Second Asia-Pacific Conference on Conceptual Modelling (2005), S. Hartmann and M. Stumptner, Eds., vol. 43 of CRPIT, Australian Computer Society, pp. 59-68.
[5]	Kozen, D. Kleene algebra with tests. ACM Transactions on Programming Languages and Systems 19,3 (1997), 427-443.
[6]	Norris, M., and West, S. eBusiness Essentials. John Wiley & Sons, Chicester, 2001.
[7]	Nute, D. Defeasible Deontic Logic. Kluwer Academic Publishers, 1997.
[8]	Orriens, B., Yang, J., and Papazoglou, M. P. A framework for business rule driven web service composition. In Conceptual Modeling for Novel Application Domains (2003), vol. 2814 of LNCS, Springer-Verlag, pp. 52-64.
[9]	Schewe, K.-D. Bargaining in e-business systems. In Perspectives in Conceptual Modeling, J. Akoka et al., Eds., vol. 3770 of LNCS. Springer-Verlag, 2005, pp. 333-342.
[10]	Schewe, K.-D., Kaschek, R., Wallace, C., and Matthews, C. Emphasizing the communication aspects for the successful development of electronic business systems. Information Systems and EBusiness Management 3, 1 (2005), 71-100.
[11]	Schewe, K.-D., and Thalheim, B. Conceptual modelling of web information systems. Data and Knowledge Engineering 54, 2 (2005), 147-188.
[12]	Simon, C., and Dehnert, J. From business process fragments to workflow definitions. In Proceedings EMISA 2004 (2004), F. Feltz, A. Oberweis, and B. Ot-jacques, Eds., vol. P-56 of Lecture Notes in Informatics, GI, pp. 95-106.
[13]	Whyte, W. S. Enabling eBusiness. John Wiley & Sons, Chicester, 2001.
Figure 5: Mindmap of User Knowledge and Skills Profiles in Bargaining
An Overview of Multimedia Content-Based Retrieval Strategies
Ankush Mittal
Deptt. of Electronics and Computer Engg. Indian Institute of Technology, Roorkee, India 247667
E-mail: ankumfec@iitr.ernet.in
Keywords: content based retrieval, syntactic indexing, semantic indexing, perceptual features, matching techniques, learning methods, mpeg-7
Received: October 14, 2005
Recently, information retrieval for text and multimedia content has become an important research area. Content based retrieval in multimedia is a challenging problem since multimedia data needs detailed interpretation from pixel values. In this paper, an overview of the content based retrieval is presented along with the different strategies in terms of syntactic and semantic indexing for retrieval. The matching techniques used and learning methods employed are also analyzed, and key directions for future research are also presented.
Povzetek: Opisane so strategije iskanja multimedijskih informacij.
1 Introduction
The last two decades have resulted in a substantial progress in the multimedia and storage technology that has led to building of a large repository of digital image, video, and audio data. There are a number of text-search engines on the web and incidentally, the sites hosting them are amongst the busiest sites. However, searching for a multimedia content is not as easy because the multimedia data, as opposed to text, needs many stages of pre-processing to yield indices relevant for querying. Since an image or a video sequence can be interpreted in numerous ways, there is no commonly agreed-upon vocabulary. Thus, the strategy of manually assigning a set of labels to a multimedia data, storing it and matching the stored label with a query will not be effective. Besides, the large volume of video data makes any assignment of text labels a massively labor intensive effort.
In recent years research has focused on the use of internal features of images and videos computed in an automated or semi-automated way [1], [2]. Automated analysis calculates statistics which can be approximately correlated to the content features. This is useful as it provides information without costly human interaction.
The common strategy for automatic indexing had been based on using syntactic features alone. However, due to its complexity of operation, there is a paradigm shift in the research of identifying semantic features [3]. User-friendly Content-Based Retrieval (CBR) systems operating at semantic level would identify motion-features as the key besides other features like color, objects etc., because motion (either of camera motion or shot editing) adds to the meaning of the content. The focus of present motion-based systems had been mainly in identifying the princi-
pal object and performing retrieval based on cues derived from such motion. With the objective of deriving semantic level indices, it becomes important to deal with the learning tools. The learning phase followed by the classification phase are two common envisioned steps in CBR systems. Rather than the user mapping the features with semantic categories, the task could be shifted to the system to perform learning (or training) with pre-classified samples and determine the patterns in an effective manner.
This paper is organized as follows. In section 2, analysis of level of abstraction of the content in CBR systems is presented. Syntactic indexing and semantic indexing are also discussed in this section. Section 3 discusses the motion feature as indexing cue with several examples. Section 4 elaborates on matching techniques in CBR systems while the learning methods in retrieval is discussed in section 5. The structure in multimedia content is discussed in section 6 followed by conclusion in section 7.
2 Level of abstraction of the content
Multimedia content can be modeled as a hierarchy of abstractions. At the lowest level are the raw pixels with unprocessed and coarse information such as color or brightness. The intermediate level consists of objects and their attributes, while the human level concepts involving the interpretation of the objects and perceptual emotion form the highest level.
Based on the above hierarchy, descriptive features in multimedia, furnished to the users of content-based technology, can be categorized as either syntactic features or semantic features [3]. A syntactic feature is a low-level characteristic of an image or a video such as an object
boundary or color histogram. A semantic feature, which is functionally at a higher level of hierarchy, represents an abstract feature such as the label grass assigned to a region of an image or a descriptor 'empathy of apprehension' for a video shot (a shot is a sequentially recorded set of frames representing a continuous action in time and space by a single camera [4],[5]). Succinctly, the retrieval process can be conceived of as the identification and matching of features in the user's requested pattern against features stored in the database. While extraction of the syntactic features is relatively undemanding, the semantic features are more appealing to the user as they are closer to the user's personal space. At higher level of user interaction the semantic features are more useful as compared to the syntactic features. For example, it is more common to have a query like "show next sports shot", "show interesting shots from a soccer match" as compared to the query "search for next zoom". One interesting point to note in the above example is that zoom-in may be one of the characteristics for an interesting shot in a soccer match but the user does not need to know it. Thus the user will not be required to construct his query in low level details in the former paradigm.
To make the distinction clearer, consider the CBR systems like QBIC [1], Virage [6], and JACOB [7], where image and video content are represented by a set of syntactic attributes like color, textures, shape, layout and global motion. The users are queried through this set of features alone. On the other hand, some examples of semantic attributes are: City vs. Landscape or Indoor vs. Outdoor (in Vailaya et al. [8]), action description of single object, interaction description of multiple objects and event recognition (in Kurokawa et al. [9]), categorization in the film genres like news cast, tennis, basketball etc. (in Mittal et al. [10]), and categorization in terms of violence or motion (Vasconcelos [11]).
2.1 Syntactic indexing
Some of the prominent CBR systems are IBM's QBIC [1], ViBE [12] at Purdue University, Visualseek [13] & VideoQ [14] at Columbia University, Photobook [15] & FourEyes [16] at M.I.T., Chabot [17] at University of CaliforniaBerkeley, MARS [18] at UIUC, Virage [6] at University of Michigan, Netra at University of California (Santa Barbara) and Jacob [7,19] at Italy. These systems use syntactic features as the basis for matching and employ either Query-by-Example or Query-through-dialog box to interface with the user. Thus, they operate at a lower level of abstraction and therefore, the user needs to be highly versed in the details of the CBR system to take advantage of them.
Popular automatic image indexing systems (as CHABOT [17], VisualSEEK [13]) employ user composed queries which are provided through the dialog box. However this method is not convenient as the user needs to know the exact details of the attributes and their implementation as well as details of the search method. However, the operation of such systems is highly technical.
The only alternative to 'Query through dialog box' was thought to be 'Query by example' technique where the user is presented with a number of example images and he indicates the closest. The various features of the chosen image are evaluated and matched against the images in the database. The features which have been commonly used in previous work are color, shape, textures and spatial distribution. Using some distance metric, the distance between the feature vectors (i.e. vectors containing the set of features) for the example image and a database image is computed. A few images which have a distance less than a threshold are retrieved. The user browses through them and if he is not satisfied, could formulate a new query in terms of either one of the retrieved images or the old image.
There has been a parallel of 'Query by example' in the field of video indexing. The majority of work in video indexing has focused on the detection of key frames called representative frames or R-Frames [20], [21], [22]. The R-Frames are chosen based on some predefined criteria and the feature set is constructed using the R-Frames. The user is again provided as output the choice between various R-Frames of video clips which are close to the user query.
There are a number of defects with retrieving items with 'Query by example':
1.	In contrast to a clearly defined text search, in image search, using 'query by example', the image can be annotated and interpreted in many ways. For example, a particular user may be interested in a waterfall, another may be interested in mountain and yet another in the sky, although all of them may be present in the same image.
2.	It is reasonable for the user to wonder "why do these two images look similar?" or "what specific parts of these images are contributing to the similarity?"(see CANDID [23]). Thus the user is required to know the search structure and other details for efficiently searching the database.
3.	Since there is no matching of exactly defined fields in query by example, it requires a larger similarity threshold as it usually involves many more comparisons than query via the dialog box. The number of images retrieved are so many that it makes the whole task tedious and sometimes meaningless.
We deem that there is a significant lacuna in addressing human level perception and cognitive capabilities of a common user as neither the 'Query by example' nor the 'Query through dialog box' attempt any higher-level analysis of the multimedia content. The syntactic features provided to the user may be adequate only if the goal is to find frames with similar distributions of color or texture or other low level characteristics. However the user often deals with and is more concerned with higher level objects. Rudimentary and unprocessed syntactic features inherently lack the power of descriptiveness required for the user to properly interact with and utilize CBR system. Some progress
is made recently with the work by Krishnapuram et al. [24]. They develop a fuzzy framework which can handle exemplar-based, graphical-sketch-based, as well as linguistic queries involving region labels, attributes, and spatial relations. The system uses Fuzzy Attributed Relational Graphs to represent images, where each node in the graph represents an image region and each edge represents a relation between two regions.
2.2 Semantic indexing
Researchers have recently been reviewing the appropriateness of these approaches based on syntactic features. There has been some effort in the direction of developing techniques which are based on analyzing the contents of images and videos on a higher level. A number of psychological studies and experiments emphasize the need for extracting the semantic information from images and video data. The two important researches in this direction are: a) Demonstrating that higher similarity-ratings are produced by perceptually-relevant semantic features as opposed to the features derived from color histograms on the images ([25]), and b) the performance and the efficiency of searching is generally greatly improved by using semantic cues ([26]) as compared to when low-level features are employed.
One can find a lot of work, developed lately, employing semantic technique. Shannon et al. [27] have analyzed and looked specifically at video-taped presentations in which the camera is focused on the speaker's slides projected by an overhead projector. By constraining the domain they are able to define a "vocabulary" of actions that people perform during a presentation. In the work done by Gong et al. [28], video content parsing is done by building a priori model of a video's structure based on domain knowledge. Out of the set of recorded shots, shots pertaining to news category are retrieved and the user can define his choice with respect to them. Sudhir et al. [29] have worked on automatic classification in 'Tennis'. Their approach is based on generation of an image model for the tennis court lines and players. Automatically extracted tennis court lines and the players' location information are analyzed in a high-level reasoning module and related to useful high-level tennis play events.
Ferman et al. [30] and Naphade et al. [31] have recently employed probabilistic framework to construct descriptors in terms of location, objects and events. Vasconcelos et al. [11] have integrated shot length along with global motion activity to characterize the video stream with properties such as violence, sex or profanity. An interesting insight that comes out from their work is that there exists a relationship between the degree of action and the structure of visual patterns that constitute a movie.
Hanjalic [32] has given a framework for adaptive extraction of highlights from a sport video based on excitement modeling. The system utilizes the expected variations in a user's excitement by observing the temporal behavior of selected audiovisual low-level features and the
editing scheme of a video. Another work by Rasheed et al. [33] classifies movies into four broad categories: Comedies, Action, Dramas, or Horror films. Inspired by cinematic principles, four computable video features (average shot length, color variance, motion content and lighting key) are combined in a framework to provide a mapping to these four high-level semantic classes. Mean shift classification is used to discover the structure between the computed features and each film genre.
Recently, many researchers have worked in semantic image classification and natural image database organization into categories like Indoor vs. Outdoor ([34], [8] etc.), city vs. landscape ([35],[36] etc.), man-made vs. natural ([8],[37]), sunset vs. forest vs. mountain ([38] and so on.
3 Motion feature as indexing cue
Since it is often through motion that the content in a video is expressed and the attention of the viewers captivated, we review here some prominent work that has used motion features as indices for video classification.
Dimitrova et al. [39] have used object motion recovery for video classification and querying. From the low-level motion analysis, they build motion vectors. 'N-tuples' of motion vector constitute each trajectory. At the high-level motion analysis they associate an activity to a set of an object using domain knowledge rules. The visual query system allows the user to specify the path of a moving object like a player.
Courtney [40] detects moving object in the video sequence using motion segmentation module. By tracking the individual objects through the segmented data, a symbolic representation of the video is generated in the form of a directed graph describing the objects and their movements. This graph is then annotated using a rule-based classification scheme to identify the events of interest like appearance/disappearance, entrance/exit, and motion/rest of objects. He suggests the potential application of such a technique to surveillance video analysis.
Nam et al. [41] developed a scheme for video indexing based on the motion behavior of video objects. Moving objects are extracted by analyzing the layered images constructed from coarse data in 3-D wavelet decomposition. The moving objects are modeled as collections of interconnected rigid polygonal shapes and the motion signatures of these objects are computed and stored as potential query terms.
Recently developed VideoQ [14] brings up the idea of an animated sketch to formulate queries. In an animated sketch, motion and temporal duration are the key attributes assigned to each object in the sketch in addition to the usual attributes such as shape, color and texture. Using the visual palette, a scene is sketched out by drawing a collection of video objects. According to its theory, it is the spatio-temporal ordering and relationships of these objects that fully define a scene. However, since VideoQ only provides
for the temporal sketching of dominant object motion in 2-D space for querying, these queries are very technical. Note that imagining such sketches is not a straightforward task.
A key observation from most of the above studies is that high-level index formation has not been the main concern. These researchers were more interested in deriving low level descriptors such as the dominant direction, distribution of flow and trajectory of the object.
4 Matching techniques
In this section, matching techniques used in the popular CBR systems are examined. By matching technique, we mean the method of finding similarity between the two sets of multimedia data, which can either be images or videos. The parameters of such a technique, which we discuss and analyze herein, are:
1.	Level of abstraction of features
2.	Distance measures
3.	Normalization of features, if supported, or else the method of relatively weighing the features.
In VisualSEEK [13] a query is specified by the colors, sizes and arbitrary spatial layouts of the color regions, which include both absolute and relative spatial locations. A query specified by the user is translated directly into pruning operations on intrinsic parameters. For example, given the single region query: to find the region that best matches Q ={ cq, {xq ,yq ), areaq, {wq, hq )}, the query is processed by first computing the individual queries for color, location, size and spatial extent. Each of the color, size and location measures form different modules with each module utilizing a specific distance measure. The intersection of the region match lists is then computed to obtain a set of common images. Finally, the single region distance is given by the weighted sum of the color set(dset), location (dqt), area(da_t) and spatial extent distances (dm^^). The best match minimizes the total distance.
In JACOB [7], queries are based on color and texture measures. The user chooses a value between 0 and 1 to indicate the relative importance of a set of features over each other. Apart from this naive procedure no other technique for normalization is implemented. In QBIC (Query by Image Content) [1], the query is built on either color, texture, or shape of image objects and regions. QBIC computes each of the features by separate distance measures. The distance measure used for each feature is the weighted Euclidean measure where the weights reflect the importance of components of each feature. CHABOT [17] facilitates image search based on features like location, colors and concepts, examples of which are 'mostly red', 'sunset', 'yellow flowers' etc. Equal weightage is assigned in this system to all the features in
retrieving the image.
A common strategy can be discerned in these different CBR systems: they employ only low level features with distance measures similar to Euclidean distance, with no method to automatically generate the weights of the features.
None of the indexing schemes discussed so far is capable of dealing with multimodal distribution. Another problem which may arise is that the probability distribution may not be Gaussian, even though it may be unimodal. The distance measures used by these systems inherently assume that with increasing distance from the mean vector, the probability decreases. Thus, some sort of Gaussian assumption is implicitly accepted. This is the case for the Bayesian Network employed in [30] which may turn out to be ineffective.
Identifying the meaningful set of features for a given domain is important yet unexplored. Many systems (like JACOB [7]) either resort to having the user specify the relative weights to the features or like CHABOT [17], they assign equal weightage to all the features in retrieving the image or video shot. By asking the user to specify the weight of various features, an injudicious assumption is made that the user is knowledgeable enough to ascertain these to a fine degree. To rely upon human experience is not a pragmatic approach when the aim is to build an integrated system with quite a few classes and many features. Different researchers (like Doulamis et al. [42], Peng et al. [43], and Sheikholeslami et al. [44]) have identified the importance of automatically identifying the relevance of the features. They have used different variations of neural network approaches in trying to achieve this task. A technique is required by which the relevant features for a class are automatically extracted and a higher relevance is assigned to them as compared to the other features. Moreover, the issue of dealing with diverse feature measures by normalization or otherwise has not been properly dealt with.
5 Learning methods in retrieval
Recently, strategies involving learning a supervised model are emerging in the field of CBR. When there are clearly identified categories, as well as, large domain-representative training data, learning can be effectively employed to construct a model of the domain. A model generally represents a strong spatial order within the individual images and/or a strong temporal order across a sequence. In this section, the learning strategy, the domains, as well as, the learning tools are discussed with reference to various research projects.
Minka et al. [45] use an interactive learning system (based on relevance feedback) based on a society of models. Instead of employing universal similarity measure or alternately manual selection of relevant features, this approach provides a learning algorithm for selecting and
combining groups of the data. The user generates both the positive and negative retrieval examples (relevance feedback). A greedy strategy is used to select a combination of existing groupings from the set of all possible groupings.
Yang and Kuo [46] propose a hierarchical procedure using a two-level classification based on K-Nearest neighbor classifier. The coarse classification uses low-level image features, while the fine level classification is based on semantic meanings. In the coarse classification, color and edge information analysis is used to summarize the image collections with image models. In fine classification, a supervised training algorithm based on multiple feature templates is adopted to refine the classification result of each coarse class.
Demsar et al. [47] have used decision-trees for classification of images based on user's feedback with positive and negative examples. Their work is in the domain of retrieving images with particular color combination (like sunset images, images containing human faces etc.).
Ratan et al. [48] have used a multiple-instance learning scheme to model ambiguity in the supervised learning examples in natural scenes. Each image can represent multiple concepts. To replace one of these ambiguities, each image is modeled as a bag of instances (sub-blocks in the image). A bag is labeled as a positive example of a concept, if there exist some instances representing the concept, which could be a car or a waterfall scene. If there does not exist any instance, the bag is labeled as a negative example. The concept is learned by using a small collection of positive and negative examples and this is used to retrieve images containing a similar concept from the database.
Torabba et al. [37] have used discriminant structural templates for organizing scene along various semantic axes. They classify the global scene representation of an image in the following axes: degree of naturalness (artificial/natural images) and degree of openness (panoramic views/closed environments). A supervised learning stage using linear discriminant analysis (LDA) is used to generate the decision boundaries along the various semantic axes. The classification is based on Gabor textures derived from the sub-blocks of the image.
Naphade et al. [31] use Markovian framework to build probabilistic multimedia objects called multijects, which are fused from low-level features from the multiple modalities. A probabilistic framework is used to encode the higher level relationship between the multijects, which enhances or reduces the probabilities of concurrent existence of various multijects. The fundamental components of their model are sites, objects and evenets and the model is evaluated to detect explosions and waterfalls in the movies.
In the previously discussed work by Fischer et al. [49], the classification module is comparable to a human expert who is asked about his/her evaluation of closeness of a particular feature. The estimates of different classification modules are combined into a final guess. Their strategy depends on the assistance from a human knowledge base to distinguish the style profiles of the features.
The researchers working in the semantic image classification have typically used color, texture, objects etc. as features for mapping to higher level concepts by learning through K-nearest neighbor (like Szummer et al. [34]), Rule-based systems (Gorkhani et al. [35]), Linear discriminant analysis (Torralba et al. [37]), Vector quantization (Vailaya and Jain [38]), Decision trees (Forsyth et al. [36]) and Support vector machine (Sadlier et al. [50]). Recently, a framework for cluster-based retrieval of images by unsupervised learning is also proposed by Chen et al. [51]. Data mining techniques have also been employed to bridge the gap between semantic labels and low-level features. In particular, association rule mining has been used by Zhu et al. [52] for semantic indexing and event detection.
The ability to infer high-level understanding from a multimedia content has proven to be a difficult goal to achieve. The goal is to present supervised learning framework where the content of the semantic indices are properly modeled and learnt. Of course, not all semantic categories can be understood and extracted by present algorithms easily, for example, the category "John eating icecream". Such categories might require the presence of sophisticated scene understanding algorithms along with the understanding of spatio-temporal relationship between entities (like the behavior eating can be characterized as repeatedly putting something eatable in mouth).
However, there are still a large number of multimedia categories (especially in the domain of video) that demonstrate structure in their elements. This structure can be exploited to build models. The structure is present because content creation is not a random process, but rather, it obeys a series of well-established codes and conventions. These structures in many cases can be detected by paying more attention to the features directly encoding knowledge or manifesting psychological significance. Automatic techniques for properly mapping the feature space to the high-level descriptors are then required, otherwise, the design process for CBR system becomes highly complex with hundreds of features and a large number of categories.
6 Structure in multimedia content
Most multimedia data are viewed as part of a casual activity, for example, people customarily watch news over breakfast, watch movies while talking on the phone, and listen to radio while driving [53]. This requires only a share of the viewer's cognitive resources and therefore, the message is generally laid out in a way that minimizes the effort required to decode it. Furthermore, to achieve efficiency in content-production and due to the limited number of available resources, standard techniques are employed. While there are clear incentives for innovation, content production evolves by building on previously developed formulae that have sustained the testing of time and market [54, 55]. Thus, it can be naturally assumed that most of the video content exhibits a significant amount of structure in its ele-
Intentions
Choice
Style
Vary the point	
of view	(
Flat Cut
Flashback
H, Dissolve
Quiet Introduction
^—-M Fade-in
Peaceful Ending
Time Lapses
Fade-out )
Change of Scene
Wipe
Excitement
Fast Cuts
Figure 1: Conveying the meaning through styles
ment.
The structures are present as a result of the s table nature of the world and the ways in which the viewers perceive and interact with the world. For a perceiver to develop the inferential leverage necessary to disambiguate among several conflicting configurations of the world, the world must behave regularly [56, 57]. In other words, these structures embody the relationship of the observers with the world [58] or in this case, the virtual world presented by the video. Movie picture viewing or communication is possible due to constancies of these relationships. These patterns of interaction also make it possible to represent the events or movie theme. In this section, why and how the video classes are generally structured are considered from two angles: from the producer's end and from the nature of the content itself.
Some relevant works have been done founded around the observation that the media has structure. In the work by Fan et al. [3], the hierarchical structure of the semantics-sensitive video classifier is derived from the domain-dependent concept hierarchy of video contents in the database. Relevance analysis is used to shorten the semantic gap by selecting the discriminating visual features and suitable importance. The EM algorithm is used to determine the classification rule for each visual concept node.
6.1 The designer's end
The intention of video making is to represent an action or to evoke emotions using various storytelling methods. Figure 1 gives an analysis of the basic techniques of shot-transitions that are used to convey particular intentions. A similar study can be done for camera motion, light-
ing effects etc. (please refer to cinematographic literature [55, 59]). For example, panning for a long duration is used for 'an establishing shot', zooming-in is used for increasing the interest of the user and so on. Although these rules are mere guidelines and can be violated, their use tends to deepen the filmic reality. Consider, for example, that a director fails to use fast cutting at a scene of climax in a movie. This would reduce the thrill in the mind of the audience, although the entire set-up remains the same. Nack and Parkes [60] remark, while comparing movies with a theatrical performance, that the denotative material of the film becomes real through the audience's identifications and projections. This leads to a generic deduction that film styles like editing effect, movement of the camera, subjects in the frame, colors, variation of lighting effects etc. are meaningfully-directed and intentional.
6.2	The nature of content
The structure of a multimedia class like sports, commercials, news etc. also stems from the pattern inherent in the material that is portrayed. These patterns then become the characteristics of the class and distinguish it against others. To illustrate the point with some common examples: car-race video has unusual zoom-in and zoom-out, basketball has left-panning and right-panning that last for certain maximum duration (say 20 seconds), the color of tennis sequence is mainly restricted to that of the four types of court according to the international standards, the motion activity in interesting shots in sports is higher than its surrounding shots and so on. As examples of works demonstrating structure in domain other than movies, it was shown ( Mittal et al. [61]) that the classes form separable cliques in the feature spaces (with feature representation improved by making it fine-grained) and reasonable classification accuracy is achieved. Another work by Eickeler et al. [62] exploits the special structure of news in 'begin shot', 'newscaster shot', 'interview', 'weather forecast' etc. and builds a video model of news. The feature vectors are modeled and classified using HMMs in the domain of broadcast news.
6.3	Discussion and Future of CBR systems
The analysis presented in the paper has two implications. First is that since there is some structure in the film-making process, there is a possibility of deriving some conclusions about the intentions or meaning conveyed through a shot. Of course, as Figure 1 shows there is ambiguity in making such conclusions, for example, dissolve can be either due to 'flashback' or due to 'time lapse'. However, by inclusion of several cues, especially context, much clearer distinction is possible. To take the same example, in a moving window of seven shots, if the number of dissolves is two, the dissolves belong most probably to 'flashback'; however, if it is more than two, the dissolves probably denote a 'time lapse'. The second implication is that the cinematic
theories of psychology and techniques used by cameramen and directors in making a film clearly expound the need to have features which have psychological relationship with humans. Naturally, the integration of higher-level features would increase the classification accuracy of video classes belonging to non-movie domains like news, soccer etc.
The process of information representation remains incomplete without the features which are at a perceptual level. Perceptual-level features also reveal fundamental structure about the content of the video data. For example, the presence of zoom-in, followed by relative camera stability, have been shown to be a good indicator of interesting shots in the home videos. These structures are such a fundamental characterization of the multimedia classes that even including a large number of classes in the CBR system does not cause problems in distinguishing them from one another. In other words, such primitives are applicable in general environments.
To realize the need for CBR system, the systematic development of the new member of the MPEG family, called "Multimedia Content Description Interface" (in short 'MPEG-7') is currently pursued. MPEG-7 will extend the limited capabilities of proprietary solutions in identifying content that exist today, notably by including more data types. In other words, it will specify a standard set of descriptors that can be used to describe various types of multimedia information. MPEG-7 will also standardize ways to define other descriptors as well as structures (Description Schemes) for the descriptors and their relationships. This description (i.e. the combination of descriptors and description schemes) will be associated with the content itself, to allow fast and efficient searching for material of a user's interest.
7 Conclusions
While the above works in the semantic domain disclosed the potentiality of description in semantic terms, a systematic exploration of construction of high-level indexes is lacking. The literature survey presented before evinces the fact that most systems operate only at syntactic level and provide low-level descriptors such as color, shape, and textures. Some attempted work at semantic level (for example, [27], [28]) confined themselves to data modeling in specific domains. Other works at semantic level (for example, [11], [49]) exclusively tried to derive semantic properties from low-level properties. This paradigm of deriving semantic indices needs to be explored further. However, none of the work has considered exploring features close to the human perception.
The need to have features which have psychological relationship with human is clearly expounded by cinematic theories of psychology and techniques used by cameramen and directors in making a film. Nack and Parkes [60] remark, while comparing movies with a theatrical performance, that the denotative material of the film becomes real
through the audience's identifications and projections. One aspect of film reality is, therefore, the imagination of the audience. Consider our experience of camera movement as it appears on the screen prior to our conscious reflection about it. The experience is a relatively 'invisible' one - particularly if we are used to viewing narrative rather than experimental films. We become aware of camera movement as our movement and perceive the camera as an invisible but present subject. This has a lot of implication in creating semantic indices as camera movement in a video is meaningfully-directed and intentional. For example, a pan which is described as a particular rotation of the camera on its vertical axis from a stationary point, may be used to establish the contiguity of screen space, and leads the viewer to understand and feel from this expression the 'sweep' and 'scope' of a monument valley landscape and the stagecoach crossing it ([63]).
In summary, there is a great need to extract semantic indices for making the CBR system serviceable to the user. Though extracting all such indices might not be possible, there is a great scope for furnishing the semantic indices witha certain well-established structure.
References
[1]	M. Flickner et al. Query By Image and video Content : the QBIC system. IEEE Computer, pages 23-32, September 1995.
[2]	H.J. Zhang, C.Y. Low, S.W. Smoliar, and J.H. Wu. Video Parsing Retrieval and Browsing : an Integrated and Content Based Solution. In Proc. of Multimedia '95, San Francisco, CA USA, pages 15-24, 1995.
[3]	J. Fan, A. K. Elmagarmid, X. Zhu, W. G. Aref, and L. Wu. Classview: Hierarchical video shot classification, indexing, and accessing. IEEE Transactions on Multimedia, vol. 6, pages 70-86, 2004.
[4]	Y. A. Aslandogan and C. T. Yu. Techniques and systems for image and video retrieval. IEEE transactions on Knowledge and data engineering, Vol. 11 No. 1, pages 56-63, 1999.
[5]	A. K. Jain, A. Vailaya, and X. Wei. Query by video clip. Multimedia systems, pages 369-384, 1999.
[6]	J. Bach et. al. The virage search engine: An open framework for image search engine. In SPIE conference on Storage and retrieval of Image and video databases, pages 76-87, 1996.
[7]	M. L. Cascia and E. Ardizzone. JACOB : Just a content-based query system for video databases. In IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pages 1216-1219, May 1996.
[8]	A. Vailaya and A. K. Jain. Incremental learning for bayesian classification of images. International Conference on Image Processing, 2:585-589, 1999.
[9] M. Kurokawa, T. Echigo, A. Tomita, J. Maeda, H. Miyamori, and S. Iisaku. Representation and retrieval of video scene by using object actions and their spatio-temporal relationships. International Conference on Image Processing, 2:86-90, 1999.
[10]	A. Mittal and L.-F. Cheong. Addressing the problems of bayesian network classification of video using high-dimensional features. IEEE Transactions on knowledge and data engineering, pages 230-244, 2004, vol. 16.
[11]	N. Vasconcelos and A. Lipman. Towards semanti-cally meaningful feature spaces for the characterization of video content. Proc. of Int. Conf. on Image Processing, pages 25-28, 1997.
[12]	J.-Y. Chen, C. Taskiran, E. J. Delp, and C. A. Bouman. Vibe: A new paradigm for video database browsing and search. Workshop on Content-based access of image and video libraries, pages 96-100, 1998.
[13]	J.R. Smith and S. F. Chang. VisualSEEK: a fully automated content-based image query system. ACM Multimedia, pages 87-98, November 1996.
[14]	S. F. Chang, W. Chen, H. J. Meng, H. Sundaram, and D. Zhong. VideoQ: An automated content based video search system using visual cues. ACM Multimedia, pages 313-324, 1997.
[15]	A. Pentland, R. Picard, and S. Sclaroff. Photobook: Tools for content-based manipulation of image databases. In SPIE conference on Storage and retrieval of Image and video databases, pages 34-47, 1994.
[16]	R. W. Picard and T. P. Minka. Vision texture for annotation. Multimedia systems, 3:3-14, 1995.
[17]	V. E. Ogle and M. Stonebraker. CHABOT: Retrieval from a relational database of images. IEEE Computer, pages 40-48, September 1995.
[18]	S. Mehrotra, Y. Rui, M. Ortega, andT. S. Huang. Supporting content-based queries over images in mars. Internation conference on multimedia computing and systems, pages 632-633, 1997.
[19]	E. Ardizzonei and M. L. Cascia. Automatic video database indexing and retrieval. In Multimedia Tools and Applications, Kluwer Academic Publishers, Boston MA, 1996.
[20]	K. Otsuji and Y. Tonomura. Projection-detecting filter for video cut detection. Multimedia Systems, Vol. 1, pages 205-210, 1994.
[21]	R. Zabih, J.Miller, and K. Mai. Video browsing using edges and motion. IEEE Int. Conf. on Computer Vision and Pattern Recognition, pages 439-446, 1996.
[22]	H. Zhang, A. Kankanhalli, and S. W. Smoliar. Automatic partitioning of full motion video. Multimedia Systems, Vol.1, pages 10-28, 1993.
[23]	P. M. Kelly, T. M. Cannon, and D. R. Hush. Query by image example : the CANDID approach. Proc. of the SPIE, Storage and Retrieval for Image and Video Databases III, Vol. 2420, pages 238-248, 1995.
[24]	R. Krishnapuram, S. Medasani, S.-H. Jung, Y. Choi, and R. Balasubramaniam. Content-based image retrieval based on a fuzzy approach. IEEE Trans. Knowl. Data Eng., vol. 16, page 2004, 1185-1199.
[25]	B. E. Rogowitz, T. frese, J. R. Smith, C. A. Bouman, and E. Kalin. Perceptual image similarity experiments. SPIE Conference on Human and Electronic Imaging, pages 576-590, 1998.
[26]	T. V. Papathomas, T. E. Conway, I. J. Cox, J. Ghosn, M. L. Mitter, T. P. Minka, and P. N. Yianilos. Psy-chophysical studies of the performance of an image database retrieval system. SPIE Conference on Human and Electronic Imaging, pages 591-602, 1998.
[27]	X. J. Shannon, M. J. Black, S. Minneman, and D. Kimber. Analysis of gesture and action in technical talks for video indexing. IEEE Int. Conf. on Computer Vision and Pattern Recognition, pages 595-601,
1997.
[28]	H.J. Zhang, Y. Gong, S.W. Smoliar, and S. Y. Tan. Automatic parsing of news video. In Proc. Of Int. Conf. On Multimedia Computing and Systems, Boston, Massachusetts, USA, pages 45-54, May 1994.
[29]	G. Sudhir, C. M. Lee, and A. K. Jain. Automatic classification of tennis video for high-level content-based retrieval. In IEEE Workshop on Content-based Access of Image and Video Databases, 1998.
[30]	A. M. Ferman and A. M. Tekalp. Probabilistic analysis and extraction of video content. Proc. Of ICIP, pages 91-95, vol. 2, 1999.
[31]	M. R. Naphade, T. Kristjansson, B. Frey, and T. S. Huang. Probabilistic multimedia objects (multijects): A novel approach to video indexing and retrieval in multimedia systems. Proc. of ICIP, pages 536-40,
1998.
[32]	A. Hanjalic. Adaptive extraction of highlights from a sport video based on excitement modeling. IEEE Transactions on Multimedia, vol. 7, page 2005, 11141122.
[33]	Z. Rasheed, Y. Sheikh, and M. Shah. On the use of computable features for film classification. IEEE Trans. Circuits Syst. Video Techn., vol. 15, page 2005, 52-64.
[34]	M. Szummer and R. W. Picard. Indoor-outdoor image classification. Int. Workshop on Content-based access of image and video databases, 1998.
[35]	M. M. Gorkhani and R. W. Picard. Texture orientation for sorting photos. International conference on Pattern recognition, pages 459-464, 1994.
[36]	D. Forsyth, J. Malik, M. Fleck, H. Greenspan, T. Leung, S. Belongie, C. Carson, and C. Bregler. Finding pictures of objects in large collections of images. Int. Workshop on Object Recognition for Computer Vision, pages 335-360, 1996.
[37]	A. B. Torralba and A. Oliva. Semantic organization of scenes using discriminant structural templates. International Conference on computer vision, pages 12531258, vol. 2, 1999.
[38]	A. Vailaya and A. K. Jain. Reject option for vq-based bayesian classification. International Conference on pattern recognition, pages 48-51, 2000.
[39]	N. Dimitrova and F. Golshani. Motion recovery for video content classification. ACM transactions on Information Systems, pages 408-439, Vol. 13, No. 4, 1995.
[40]	J. D. Courtney. Automatic video indexing via object motion analysis. Pattern Recognition, pages 607625, Vol. 30, No. 4, 1997.
[41]	J. Nam and A. H. Tewfik. Motion based video object indexing using multiresolution analysis. Proc. of SPIE: Storage and Retreival for Image and Video database, pages 688-695, 97.
[42]	N. D. Doulamis, A. D. Doulamis, and S. D. Kollias. A neural network approach to interactive content-based retrieval of video databases. International Conference on Image Processing, pages 116-120, 1999, vol. 2.
[43]	W. S. Peng and N. DeClaris. Heuristic similarity measure characterization for content-based image retrieval. IEEE conference on systems, man and cybernetics, pages 7-12, 1997.
[44]	G. Sheikholeslami, S. Chatterjee, and A. Zhang. Neu-romerge: An approach for merging heterogeneous features in content-based image retrieval systems. IEEE workshop on multimedia database management systems, pages 106-113, 1998.
[45]	T. P. Minka and R. W. Picard. Interactive learning using a 'society of models'. Pattern recognition, 30:565, 1997.
[46]	Z. Yang and C. C. J. Kuo. A semantic classification and composite indexing approach to robust image retrieval. International Conference on Image Processing, pages 134 - 138, 1999, vol. 1.
[47]	J. Demsar and F. Solina. Using machine learning for content-based image retrieving. International Conference on Pattern Recognition, pages 138-142, 1996, vol. 3.
[48]	A. L. Ratan, O. Maron, W. E. L. Grimson, and T. Lozano-Perez. A framework for learning query concepts in image classification. Proc. IEEE Computer vision and pattern recognition (CVPR), pages 423-429, 1999.
[49]	S. Fischer, R. Lienhart, and W. Effelsberg. Automatic Recognition of Film Genres. In ACM Multimedia 95 - Electronic Proceedings, San Francisco, California, pages 295-304, November 1995.
[50]	D. Sadlier and N. O'Connor. Event detection in field sports video using audio-visual features and a support vector machine. IEEE Transactions on Circuits Systems and Video Technology, pages 1225-1233, 2005.
[51]	Y. Chen, J. Z. Wang, and R. Krovetz. Clue: cluster-based retrieval of images by unsupervised learning. IEEE Transactions on Image Processing, vol.14, pages 1187-1201,2005.
[52]	X. Zhu, X. Wu, A. K. Elmagarmid, Z. Feng, and L. Wu. Video data mining: Semantic indexing and event detection from the association perspective. IEEE Trans. Knowl. Data Eng, vol. 15, page 2005, 665-677.
[53]	W. R. Neuman. The future of the mass audience. NY: Cambridge University press, 1991.
[54]	N. Vasconcelos and A. Lipman. A bayesian framework for semantic content characterization. Proc. of IEEE Conference on Computer vision and pattern recognition, pages 566-561, 1998.
[55]	B. Salt. Film Style and Technology:History and Analysis. Starwood, London. 2nd. Edition. 1992.
[56]	D. Lowe. Perceptual organization and visual recognition. Kluwer Academic publishers, Boston, 1985.
[57]	A. Jepson, W. Richards, and D. Knill. Modal structure and reliable inference. In Perception as Bayesian Inference, Cambridge University press, 1996.
[58]	C. Fermuller and Y. Aloimonos. Vision and action.
Image and vision computing, pages 725-44, vol. 13, no. 10, 1995.
[59]	G. F. Kawin. How Movies work. Macmillan Publishing, New York. 1987.
[60]	F. Nack and A. Parkes. The application of video semantics and theme representation in automated video editing. Multimedia tools and applications, pages 5783, 1997.
[61]	A. Mittal and L.-F. Cheong. Framework for synthesizing semantic-level indices. Journal of Multimedia tools and application, pages 135-158, 2003.
[62]	S. Eickeler and S. Muller. Content-based video indexing of tv broadcast news using hidden markov models. IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 2997-3000, 1999.
[63]	V. Sobchack. Toward inhabited spce: The semiotic structure of camera movement in the cinema. Semot-ica, pages 317-335, 1982.
A Three-Phase Algorithm for Computer Aided siRNA Design
Hong Zhou
Saint Joseph College, West Hartford, CT 06117, USA hzhou@sjc.edu
Xiao Zeng
Superarray Bioscience Corporation, 7320 Executive Way, Frederick, MD 21704, USA xzeng@superarray.net
Yufang Wang and Benjamin Ray Seyfarth
University of Southern Mississippi, Hattiesburg, MS 39406, USA
Keywords: siRNA, RNA interference, three-phase, Smith-Waterman, BLAST
Received: July 10, 2005
As our knowledge of RNA interference accumulates, it is desirable to incorporate as many selection rules as possible into a computer-aided siRNA-designing tool. This paper presents an algorithm for siRNA selection in which nearly all published siRNA-designing rules are categorized into three groups and applied in three phases according to their identified impact on siRNA function. This tool provides users with the maximum flexibility to adjust each rule and reorganize them in the three phases based on users ' own preferences and/or empirical data. When the generally accepted stringency was set to select siRNA for 23,484 human genes represented in the RefSeq Database (NCBI, human genome build 35.1), we found 1,915 protein-coding genes (8.2%) for which none suitable siRNA sequences can be found. Curiously, among these 1,915 genes, two had validated siRNA sequences published. After close examination of another 105 published human siRNA sequences, we conclude that (A) many of the published siRNA sequences may not be the best for their target genes; (B) some of the published siRNA may risk off-target silencing; and (C) some published rules have to be compromised in order to select a testable siRNA sequence for the hard-to-design genes. Povzetek: Predstavljen je algoritem za obdelovanje genoma.
1 Introduction
Since the seminal paper published by Craig C. Mello's group in 1998 [1], RNA interference (RNAi) has emerged as a powerful technique to knock out/down the expression of target genes for gene function studies in various organisms [2,3,4]. What is truly remarkable about the RNAi effect is that it is sequence-specific. This means that as long as we know the sequence of the transcript to be targeted, we can design a short double-stranded RNA (small interfering RNA or siRNA) to knock down, if not eliminate the expression of the target gene without changing the genetic make-up of the cells. Compared to the anti-sense oligonucleotide technology developed earlier [5,6], RNAi is much more effective because RNAi is achieved by catalytic components within the cell [1,7,8,9].
Understandably, how to design the best siRNA has become an intense competition between academic research groups as well as commercial providers of siRNA. The following is a summary of some major designing rules published.
• The length of functional siRNAs: The length of siRNA ranges from 19 to 30 base pairs (bps) [2,10,11]. Double stranded RNA longer than 30 bps
is likely to invoke an antiviral interferon response, a general shut-down of the cellular translation instead of gene-specific RNAi [12,13,14]. The GC content of functional siRNA: The optimal GC content of siRNA should be between 30% and 55% [10,14,15]. GC-rich sequences, in general, have the tendency to form quadruplex or hairpin structures [16]. Sequences with GC stretches over 7 in a row may form duplexes too stable to be unwound [16,17,18,19]. On the other hand, sequences with extremely low GC content cannot form stable siRNA duplexes. The thermo-stability bias at the 5' end of the antisense strand: Since it is desirable to have only the antisense strand incorporated into the RISC complex, lowering the thermo-stability at the 5' end of the antisense strand can promote helicase unwind siRNA duplexes from this end [17,20,21]. Concerning tandem repeats and palindromes: Since sequences containing tandem repeats or palindromes may form internal fold-back structures, it is best to avoid any internal repeats or palindromes in the designed siRNA sequence [10]. For the same reason and other concerns [22] [23], long single nucleotide repeats (such as AAAA, UUUU, CCCC or GGGG) should also be avoided [19,24].
Regarding the specific nucleotide positions in siRNA, it has been proposed that base U at position 10, base A at position three, and a base other than G at position thirteen were preferred [10]. However, those experiments were conducted with siRNAs 19 bps in length, it is unknown if the same rules apply to longer siRNAs. While some siRNA design algorithms prefer having the siRNA sequence start with AA [14,24,25], others have pointed out that this rule may result in frequent misses of effective siRNA sequences [17]. Besides, starting with AA may sometimes conflict with the notion that 5' antisense end should be thermodynamically less stable than the 5'-sense end [17,20,21]. It is not clear whether siRNA should be picked within the coding region (CDS) only, though it has been suggested that 5' and 3' untranslated region (UTR) should be avoided [24,25]. However, a recent report showed that targeting 3'-UTR was as efficient as targeting the CDS [26]. If the siRNA (or shRNA, small hairpin RNA) is generated via T7 RNA polymerase, additional rules may apply [27].
While it is desirable to incorporate all of the selection rules into a computer aided siRNA design tool, the complication at the moment is how to rank those published rules, especially when some of the rules are contradictive. Currently, quite a few computer aided siRNA design tools have been published [17,18,19,24,25,27,28,29] and some of those have been made accessible through websites. However, none of those tools has successfully incorporated all the rules above, and most of them treat their employed rules without much differentiation. In general, the existing tools adopt a set of rules and assign each rule an equal or different score, and each siRNA sequence is scored against every rule and only those sequences scoring above a predefined point are selected as valid siRNA sequences. Such a simple selection procedure does not accommodate the possibility that some rules are critical for the validity of a siRNA sequence (must be met), while some rules can only affect the efficiency of the siRNA sequence. Meanwhile, those web-based tools only provide users very limited flexibility, and users cannot reorganize the selection rules based on their own preferences or recent research data.
Although the actual mechanism of which is still unclear, the off-target effect [30] of siRNA is largely attributed to partial sequence homology between siRNA and its unintended targets [31,32]. Most available siRNA design tools use BLAST [33] to filter out siRNA candidates that may cause off-target effect. However, BLAST may overlook significant sequence homologies [17,34]. As an alternative, the Smith-Waterman search algorithm [35] has been proposed to identify all possible off-target sequences [17]. Unfortunately, Smith-Waterman search against the whole-transcriptome is very time-consuming.
This paper presents a three-phase siRNA selection algorithm that can successfully incorporate all the major rules mentioned above effectively in a way that allows the user to optimize the selection process based on their experimental data. The incorporation of the validated rules ensures the effectiveness and specificity of the
selected siRNA sequences. Meanwhile, knowing that some of the rules may not be compatible under certain conditions, this software package has also incorporated maximum flexibility for the users to adjust the selection process based on their own experiment results or their own preferences.
2 Materials and methods
2.1	Sequence Data
Complete collection of human mRNAs in the NCBI RefSeq database (human genome build 35.1) was used as the experiment dataset. In addition, 107 published siRNA sequences that targeted human genes were collected from prestigious publications.
2.2	The Three-Phase Algorithm
The key concept of the three-phase algorithm is to arrange all the necessary siRNA selection rules in three groups of filters according to their impacts on the siRNA efficacy and apply them to the design process in three steps. Each filter represents a specific design rule. Based on the expediency of each rule, the corresponding filter may be assigned the following properties:
•	Enabled. If a filter is enabled, it is applied in the selection process; otherwise it is not used at all.
•	Mandatory. If a filter is enabled and designated as mandatory, failure to satisfy the rule results in the elimination of the tested siRNA sequence.
•	Selective. If a filter is enabled but not designated as mandatory, it is a selective filter by default. siRNA sequences will proceed to the next filter even though they fail to satisfy a "selective" filter.
•	Optional. If the validity of a selective filter is yet to be demonstrated, it will be designated as optional.
•	Gain. Positive point(s) assigned when a selective/optional filter is satisfied.
•	Penalty. Negative point(s) assessed if a selective/optional filter is not met.
As expected, all Phase I filters are mandatory if enabled, eliminating all the sequences containing the most damaging elements for a functional siRNA. All Phase II filters are selective, and will rank eligible siRNA sequences by a final score with the sum of gain and penalty points. Phase III filters represent those rules whose impact on the siRNA functionality has yet to be elucidated and therefore considered optional. The final scores of optional filters will be recorded separately and will not be used to rank the siRNA sequences as with the Phase II filters. Based on the known selection rules, here are 15 filters tested in this work:
Phase I Filters (by default enabled and mandatory):: 1. The filter for siRNA length (f-len). It requires that the length of the siRNA sequences be between 19 bps to 30 bps, inclusive (not counting the 3' two-nucleotides overheads).
2.	The filter for coding region only (f-coding). It requires that the siRNA sequences be selected only inside the coding sequence.
3.	The filter for GC content (f-gc). It requires that the GC content of a siRNA sequence lie between 32 -55 % inclusive.
4.	The filter for repeated sequences (f-repeat). It requires that a siRNA sequence have no internal repeated sequence of length >= 4.
5.	The filter for internal palindrome (f-palindrome). It requires that a siRNA sequence have no internal palindrome sequence of length >= 5.
6.	The filter for internal GC stretch (f-stretch). It requires that a siRNA sequence have no GC stretch of length > 8.
7.	The filter for untranslated region (UTR) on mRNA (f-UTR). It requires that a siRNA sequence be 100 nucleotides away from the translation start and stop codons.
8.	The filter for the polyA, polyU, polyG and polyC (f-poly). It requires that a siRNA sequence have no AAA, UUU, GGG or CCC.
Phase II Filters (by default enabled and selective)'.
9.	The filter for the AG (free energy) at the 5'-end of the antisense strand (f-dga). It requires that the AG at the 5'-end of antisense should be between -3.6 and -7.2. The gain or penalty of this filter is I or 0 respectively.
10.	The filter for the AG (free energy) difference between the 5'-end of the sense strand and the 5'-end of the antisense strand (f-dgd). It requires that the AG difference (AGdiff = AG 5-sense - AG 5-antisense) of a siRNA sequence be less than minus one (-I.0). The gain or penalty of this filter is I or -I respectively.
11.	The filter for the number of A/U in the 5'-end pentamer of the antisense strand (f-AU). Among the first five nucleotides at the 5' antisense strand, the gain matches the number of A/U nucleotides present, i.e. if there is one A/U nucleotide the gain would be one point, two A/Us will make two points gain, and so on so forth. No penalty is assessed for zero A/U nucleotide present.
12.	The filter for the nucleotide composition at the 5'-end of the sense strand (f-ssnt). If the sense strand of a siRNA sequence starts with a G/C, assess one point gain; otherwise assess minus one point penalty. If there are either one or two A/U present between the second and the fifth nucleotide (inclusive), assess one point as gain; otherwise assess minus one point as penalty.
13.	The filter for A/U ending (f-endAU). Two points are gained if the 5'-end antisense strand of a siRNA sequence starts with U. One point is gained if the 5'-end antisense strand of a siRNA sequence starts with A. No penalty is assessed if 5'-end antisense strand of a siRNA sequence starts with G or C.
Phase III Filters.
14.	The filter for starting with AA (f-aa). This filter is enabled as optional by default. If the 5'end of sense strand of a siRNA sequence starts with AA, add one point as gain. No penalty is assessed otherwise
15.	The filter for specific nucleotide positions (f-pos). This filter is enabled as optional by default. One point is gained if position three (from 5'-end) of the sense strand is A, another one point is gained if position ten is U, but minus one point is assessed as penalty if position thirteen is G.
16.	The filter for the melting temperature (Tm) of the siRNA sequence (f-Tm). For this study, this filter is not enabled. This could measure the Tm value of a siRNA sequence, and set an acceptable range for functional siRNAs [I0].
As stated above, Phase I filters are used to eliminate all sequences that bear at least one unwanted feature, i.e. all sequences that pass phase I selection must satisfy all filters in this phase. Most of the selective filters in Phase II are set to ensure the selection rule that the 5' antisense end should be less thermodynamically stable than the 5' sense end. This differential stability ensures that the antisense strand is incorporated into the RISC complex, reducing the unwanted off-target effect caused by the sense-strand [I0,I7,I9,2I,24,27,28,29]. In this study, the default cutoff for phase II selection is seven points, i.e. only those siRNA sequences that score seven points and above are considered functional. The scores of Phase III filters are reported for reference only. It would be useful for assessing the necessity of the existing and new rules. As part of the "TuschI Rule [2]", many of the original siRNA selection software require the sense-strand to start with AA. However, this rule has been challenged recently because it filters out some potential effective siRNA sequences [I7]. Therefore in this study, we set filter f-aa as optional.
2.3	BLAST and Smith-Waterman Search
Although the mechanism of siRNA's off-target effect is not fully understood, it is suggested that un-detected sequence homology by BLAST search may play a major role [I7,34]. In this study, we employed two filters to screen for the possible off-target effect. First, BLAST is applied to identify and remove any off-target matches for all the siRNA sequences that survive the three-phase selection procedure. Then, the remaining sequences are screened by the Smith-Waterman search. By definition, both BLAST and Smith-Waterman are enabled and mandatory (much like the Phase I filters), but they are applied only to the sequences that passed all other filters.
2.4	The Implementation
The three-phase selection algorithm is implemented in Java so that it could be easily deployed as a web-based tool. The software accepts input of one or multiple target genes in Genbank or FASTA formats. Since the Genbank format provides locations for the coding region of the gene (CDS), it is the preferred format used in this study. Once the start location is determined for each gene
sequence, the selection process starts by collecting siRNA candidates. It shifts one nucleotide each time along the sequence to exhaust all potential siRNA sequences and avoids any sequences that contain uncertain nucleotides other than A, T/U, G, or C because these regions may have single nucleotide polymorphism, or SNP. The selection process is diagrammed in Figure 1.
Figure 1. The flow chart of the siRNA selection process.
One of the major advantages of this tool is that it allows users to adjust all the selection criteria or even rearrange the filters in the three phases through a configuration file. Figure 2 shows an example where users can adjust the following from the graphic user interface (GUI) of this software: the length of the siRNA, the range of GC content and the definition of polymers of A, U/T, G and C, etc. The drop-down "Tool" menu shows other features of this software. The uses of both the BLAST and the Smith-Waterman searches are also selectable. However, whenever Smith-Waterman search is requested, BLAST is always performed first to minimize the computing time required for the Smith-Waterman search.
3 Results
To test the stringency of the default selection conditions described above, we applied them to the complete collection of human mRNAs in the NCBI RefSeq database (human genome build 35.1). This database contains 28,162 entries of which 27,956 are mRNA sequences, representing 23,484 protein-coding genes. Under such conditions, no suitable siRNA sequences could be found for 1915 genes (accounting for 2,075 entries, ~8.2% of the total genes). Further analysis reveals that the filters f-gc, f-poly, f-repeat and f-dgd are the major causes for those 1,915 genes to have zero siRNA sequence found. Of all the possible siRNA sequences from the 1,915 genes, 60.6% failed filter f-gc, 44.8% failed filter f-repeat, 76.4% failed filter f-poly and 65.9% failed filter f-dgd (while f-dgd is a selective filter, all others are mandatory in our default setting).
Figure 2. The graphic user interface (GUI) of the siRNA selection tool.
Interestingly, two among those 1,915 genes, PEN-2 (PSENEN, Genbank accession no. NM_172341.1) and BIRC5 (Genbank accession no. NM_001168.1) have functional siRNA sequences reported in the literature T [36]. This result suggests that some modification of the rules has to be made in order to select the functional siRNA sequences for all genes.
In order to demonstrate the flexibility of the software, we modified the configuration file so that the definition for polymers (filter f-poly) is relaxed to accept AAAA, UUUU, GGGG and CCCC. With this single modification, the number of genes without a valid siRNA candidate reduced to 855 (from 1,915). Since some published siRNA sequences had GC content over 60%, we further modified the GC content limitation (filter f-gc) to be between 30 - 60%. Under this relatively less-stringent condition, the number of unsuccessful searches (855) is further reduced to 519, and valid siRNA sequences are found for the two genes PEN-2 and BIRC5 (although they are different from the published sequences). This experiment not only shows the flexibility of the three-phase algorithm, but also demonstrates its practicality of the whole package.
Another critical issue of siRNA design is to avoid any off-target effect. Although the true nature of offtarget silencing of siRNA is yet to be elucidated, it has been suggested that the introduced siRNA will attack any mRNA sequences with less than 3 mismatches [17]. In order to demonstrate the ineffectiveness of using the BLAST filter alone in identifying those mismatches, we did the following experiments. As indicated in Table 1, we randomly chose 30 human genes and ran the three-phase selection program to get siRNA candidates before enabling the BLAST and Smith-Waterman filters. Then, about 100 siRNA candidates were randomly selected for BLAST and Smith-Waterman evaluation. After repeating this experiment 8 times, we found that about 66.6% of the siRNAs 19 bps in length could past the BLAST filter (minimum word size 7, gap penalty -1). However, after enabling the Smith-Waterman filter, we found that only 53.6% of those which passed BLAST test could survive the Smith-Waterman evaluation (gap penalty -3). Also shown in Table 1, the BLAST filter works better alone with longer siRNA sequences. For example, if the length of siRNA is set at 23 bps, it might be safe to assume the siRNA specificity without running the Smith-Waterman filter, because 99.7% of the BLAST-validated siRNAs could pass the Smith-Waterman evaluation.
To further validate our selection criteria, we collected 107 published siRNA sequences that targeted human genes. We found that only five of them could pass our default selection process. Close examination of the 102 failed sequences showed that 35 (34.3%) sequences failed the filter f-gc, 35 (34.3%) failed the filter f-repeat, 56 (54.9%) failed the filter f-poly and 68 (66.7%) failed the filter f-dgd. This result suggests that there could be many other better siRNA candidate sequences for these 107 published genes. A similar observation has been made by others [17].
	siRNA length (bps)		
	19	21	23
PB	66.6±4.0%	80.0±7.5%	87.4±6.9%
PSW	53.6±7.8%	98.6±1.6%	99.7±0.6%
Table 1. BLAST filter alone cannot safeguard the siRNA specificity. Experiments are repeated 8 times for about 100 randomly selected siRNA candidates generated from 30 randomly chosen gene sequences. Data is presented in the form of mean ± standard deviation. PB: the percentage of siRNA candidates that can pass Blast test. PSW: the percentage of siRNA candidates that can pass Smith-Waterman test after passing Blast test.
Then we ran the 107 siRNA sequences through Smith-Waterman alignment with mismatch tolerance of 3 (where an insertion or a deletion accounts for 3 mismatches [24]). We have found that 32 sequences (representing 30 genes) failed this test. This indicates that some of the publicly validated siRNA sequences (as shown in Table 2) may risk off-target effect.
4	Discussion
The three-phase algorithm categorizes the major published siRNA design rules into three groups and applies them differentially in the design process based on their impacts on the siRNA function. Since all the rules are extracted from studying one or few genes, and there is little mechanistic justification for many of the rules, we should not treat those rules as absolute dogma. Rather, we should use those rules as a general guidance. The tool described in this paper provides the maximum flexibility for the user to adjust. Over time provided with sufficient experimental data input, this siRNA selection tool can be fine-tuned to provide intelligent design of highly effective siRNA on the whole-genome scale.
Acknowledgement
The authors thank Dr. George J. Quellhorst, Jr. for critical reading of the manuscript.
5	References
[1]	A. Fire; S. Xu; M.K. Montgomery; S.A. Kostas; S.E. Driver; and C.C. Mello. "Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans." Nature, 391: 806-811, Feb 19 1998.
[2]	S.M. Elbashir; J. Harborth; W. Lendeckel; A. Yalcin; K. Weber; and T. Tuschl. "Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells." Nature, 411: 494-498, 2001/05/24/print 2001.
[3]	P.J. Paddison; A.A. Caudy; and G.J. Hannon. "Stable suppression of gene expression by RNAi in mammalian cells." PNAS, 99: 1443-1448, February 5, 2002 2002.
Source	Target	Target-Symbol	Length	Sequence	Off-target matches
[37]	NM 000314	PTEN	19	CAAAUCCAGAGGCUAGCAG	NM 015245.1, 496-478, 2
[38]	NM 005163	AKT1	19	CCGCCAUCCAGACUGUGGC	XM 379163.1, 505-487, 2
[38]	NM 000321	RB1	19	GAUACCAGAUCAUGUCAGA	NM 000132.2, 1916-1934, 2
[39]	NM 001904	CTNNB1	19	AGCUGAUAUUGAUGGACAG	XM 376254.2, 2346-2364, 2
[40]	NM 001838	CCR7	19	GAGGCUCAAGACCAUGACC	NM 000025.1, 395-413, 2
[40]	NM 001251	CD68	19	GCAAUAGCACUGCCACCAG	XM 373349.2, 656-638, 2
					NM 020528.1, 484-502, 2
[40]	NM 004355	CD74	19	ACUGACAGUCACCUCCCAG	NM 018407.3, 625-607, 2
[41]	NM 002483	CEACAM6	19	CCGGACAGUUCCAUGUAUA	NM 001712.2, 477-494, 1
					NM 001815.1, 459-476, 1
					NM 133325.1, 727-745, 2
					NM 018288.2, 733-751, 2
[40]	NM 003467	CXCR4	19	CUGGCAUUGUGGGCAAUGG	NM 033104.2, 1940-1958, 2
					XM 497933.1, 5315-5333, 2
					NM 014974.1, 4236-4217, 2
[40]	NM 021095	SLC5A6	19	UAUUGGUUCCUGGGCUGCU	NM 020919.2, 4626-4644, 2
[40]	NM 001066	TNFRSF1B	19	CAGAACCGCAUCUGCACCU	NM 000302.2, 1202-1220, 2
					NM 002077.2, 2883-2901, 2
[42]	NM 024072	DDX54	19	GAAGAAGUCUGGAGGCUUC	NM 002022.1, 577-559, 2
					NM 138342.2, 1245-1227, 2
[43]	NM 002048	GAS1	19	UGGCGCUGCUGCAGCUGCU	115 off-target matches
[44]	NM 015895	GMNN	19	CUGGCAGAAGUAGCAGAAC	NM 014865.2, 968-986, 2
					(5 other off-target matches)
[45]	NM_012154	EIF2C2	19	UGGACAUCCCCAAAAUUGA	NM 198581.1, 4109-4127, 2
					(7 other off-target matches)
[46]	NM 001945	DTR	19	UACAAGGACUUCUGCAUCC	NM 080829.1, 745-763, 2
[47]	NM 001430	EPAS1	19	GCGACAGCUGGAGUAUGAA	NM 006023.1, 267-285, 2
[48]	NM_000599	IGFBP5	19	GAAGCUGACCCAGUCCAAG	NM 052839.2, 501-519, 2
					NM 198057.1, 365-383, 2
					NM 194278.2, 3341-3359, 2
[49]	NM 001278	CHUK	19	GCAGGCUCUUUCAGGGACA	NM 020746.1, 1069-1051, 2
					NM 019107.1, 643-625, 2
[50]	NM 032726	PLCD4	19	GGAAGGAGAAGAAUUCGUA	NM 002182.2, 1451-1469, 2
[51]	NM 004156	PPP2CB	19	UGUCUGCGAAAGUAUGGGA	XM 371140.3, 650-668, 2
[52]	NM 003253	TIAM1	19	GCGAAGGAGCAGGUUUUCU	NM 014065.2, 133-115, 2
					NM 017919.1, 1236-1254, 2
[53]	NM_006044	HDAC6	19	CCAGCCAAACCUAGGUUAG	XM 042234.6, 1855-1837, 2
					(8 other off-target matches)
[38]	NM 005030	PLK1	19	GUGCUUCGAGAUCUCGGAC	XM 498286.1, 528-546, 1
[38]	NM 005030	PLK1	19	GGGCAAGAUUGUGCCUAAG	XM 498286.1, 570-588, 0
[54]	NM_005053	RAD23A	20	AAGAGCCCAUCAGAGGAAUC	NM 021574.1, 2290-2271, 2
[55]	NM 001274	CHEK1	21	GAAGCAGUCGCAGUGAAGAUU	NM 002945.2, 359-379, 2
[56]	NM 052850	GADD45GIP1	21	AAGAUGCCACAGAUGAUUGUG	XM 377715.1, 125-105, 2
[57]	NM 001419	ELAVL1	21	GUUGAAUCUGCAAAACUUAUU	XM 498103.1, 54-35, 1
[38]	NM 005030	PLK1	23	AAGGGCGGCUUUGCCAAGUGCUU	XM 498286.1, 511-533, 0
[58]	NM_005573	LMNB1	23	AAGCUGCAGAUCGAGCUGGGCAA	NM 006258.1, 179-201, 2
[59]	NM 003302	TRIP6	23	AAGGCCUACCACCCUGGCUGCUU	XM 059037.6, 1417-1439, 2
Table 2. Published siRNA sequences that may have off-target activities. Only the sense strand of the siRNA sequences are displayed. Off-target matches are arranged in order of gene accession number, the match position and the number of mismatches. If the start match position is larger than the stop match position, the homology is with the antisense strand of the searched gene.
[4]	J. Couzin. "BREAKTHROUGH OF THE YEAR: [6] Small RNAs Make Big Splash." Science, 298: 2296-2297, December 20, 2002 2002.
[5]	M.L. Stephenson, and P.C. Zamecnik. "Inhibition [7] of Rous sarcoma viral RNA translation by a specific oligodeoxyribonucleotide." Proc Natl Acad Sci U S
A, 75: 285-288, Jan 1978.
L.J. Scherer, and J.J. Rossi. "Approaches for the sequence-specific knockdown of mRNA." Nat Biotechnol, 21: 1457-1465, Dec 2003. S.M. Hammond; E. Bernstein; D. Beach; and G.J. Hannon. "An RNA-directed nuclease mediates post-transcriptional gene silencing in Drosophila cells." Nature, 404: 293-296, Mar 16 2000.
[8]	E. Bernstein; A.A. Caudy; S.M. Hammond; and G.J. Hannon. "Role for a bidentate ribonuclease in the initiation step of RNA interference." Nature, 409: 363-366, Jan 18 2001.
[9]	G.J. Hannon. "RNA interference." Nature, 418: 244-251, 2002/07/11/print 2002.
[10]	A. Reynolds; D. Leake; Q. Boese; S. Scaringe; W.S. Marshall; and A. Khvorova. "Rational siRNA design for RNA interference." Nat Biotechnol, 22: 326-330, Mar 2004.
[11]	P.D. Zamore; T. Tuschl; P.A. Sharp; and D.P. Bartel. "RNAi: Double-Stranded RNA Directs the ATP-Dependent Cleavage of mRNA at 21 to 23 Nucleotide Intervals." Cell, 101: 25-33, 2000/3/31 2000.
[12]	B.L. Bass. "RNA interference. The short answer." Nature, 411: 428-429, May 24 2001.
[13]	D.H. Kim; M. Longo; Y. Han; P. Lundberg; E. Cantin; and J.J. Rossi. "Interferon induction by siRNAs and ssRNAs synthesized by phage polymerase." Nat Biotechnol, 22: 321-325, Mar 2004.
[14]	S.M. Elbashir; J. Harborth; K. Weber; and T. Tuschl. "Analysis of gene function in somatic mammalian cells using small interfering RNAs." Methods, 26: 199-213, Feb 2002.
[15]	T. Holen; M. Amarzguioui; M.T. Wiiger; E. Babaie; and H. Prydz. "Positional effects of short interfering RNAs targeting the human coagulation trigger Tissue Factor." Nucleic Acids Res, 30: 1757-1766, Apr 15 2002.
[16]	C.C. Hardin; T. Watson; M. Corregan; and C. Bailey. "Cation-dependent transition between the quadruplex and Watson-Crick hairpin forms of d(CGCG3GCG)." BIOCHEMISTRY, 31: 833-841, 1992.
[17]	Y. Naito; T. Yamada; K. Ui-Tei; S. Morishita; and K. Saigo. "siDirect: highly effective, target-specific siRNA design software for mammalian RNA interference." Nucl. Acids Res., 32: W124-129, July 1, 2004 2004.
[18]	K. Ui-Tei; Y. Naito; F. Takahashi; T. Haraguchi; H. Ohki-Hamazaki; A. Juni; R. Ueda; and K. Saigo. "Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference." Nucleic Acids Res, 32: 936-948, 2004.
[19]	B. Yuan; R. Latek; M. Hossbach; T. Tuschl; and F. Lewitter. "siRNA Selection Server: an automated siRNA oligonucleotide prediction server." Nucl. Acids Res., 32: W130-134, July 1, 2004 2004.
[20]	J. Martinez; A. Patkaniowska; H. Urlaub; R. Luhrmann; and T. Tuschl. "Single-stranded antisense siRNAs guide target RNA cleavage in RNAi." Cell, 110: 563-574, Sep 6 2002.
[21]	A. Khvorova; A. Reynolds; and S.D. Jayasena. "Functional siRNAs and miRNAs exhibit strand bias." Cell, 115: 209-216, Oct 17 2003.
[22]	E.P. Geiduschek, and G.A. Kassavetis. "The RNA polymerase III transcription apparatus1." Journal of Molecular Biology, 310: 1-26, 2001/6/29 2001.
[23]	G. Laughlan; A.I.H. Murchie; D.G. Norman; M.H. Moore; P.C.E. Moody; D.M.J. Lilley; and B. Luisi. "The high-resolution crystal structure of a parallel-stranded guanine tetraplex." Science, 265: 520-524, 1994.
[24]	W. Cui; J. Ning; U.P. Naik; and M.K. Duncan. "OptiRNAi, an RNAi design tool." Comput Methods Programs Biomed, 75: 67-73, Jul 2004.
[25]	N. Levenkova; Q. Gu; and J.J. Rux. "Gene specific siRNA selector." Bioinformatics, 20: 430-432, Feb 12 2004.
[26]	A.C. Hsieh; R. Bo; J. Manola; F. Vazquez; O. Bare; A. Khvorova; S. Scaringe; and W.R. Sellers. "A library of siRNA duplexes targeting the phosphoinositide 3-kinase pathway: determinants of gene silencing for use in cell-based screens." Nucleic Acids Res, 32: 893-901, 2004.
[27]	P. Dudek, and D. Picard. "TROD: T7 RNAi Oligo Designer." Nucl. Acids Res., 32: W121-123, July 1, 2004 2004.
[28]	A. Henschel; F. Buchholz; and B. Habermann. "DEQOR: a web-based tool for the design and quality control of siRNAs." Nucleic Acids Res, 32: W113-120, Jul 1 2004.
[29]	P. S[ae]trom, and J. Snove, Ola. "A comparison of siRNA efficacy predictors." Biochemical and Biophysical Research Communications, 321: 247253, 2004/8/13 2004.
[30]	S.P. Persengiev; X. Zhu; and M.R. Green. "Nonspecific, concentration-dependent stimulation and repression of mammalian gene expression by small interfering RNAs (siRNAs)." Rna, 10: 12-18, Jan 2004.
[31]	A.L. Jackson; S.R. Bartz; J. Schelter; S.V. Kobayashi; J. Burchard; M. Mao; B. Li; G. Cavet; and P.S. Linsley. "Expression profiling reveals offtarget gene regulation by RNAi." Nat Biotechnol, 21: 635-637, Jun 2003.
[32]	P.C. Scacheri; O. Rozenblatt-Rosen; N.J. Caplen; T.G. Wolfsberg; L. Umayam; J.C. Lee; C.M. Hughes; K.S. Shanmugam; A. Bhattacharjee; M. Meyerson; and F.S. Collins. "Short interfering RNAs can induce unexpected and divergent changes in the levels of untargeted proteins in mammalian cells." PNAS, 101: 1892-1897, February 17, 2004 2004.
[33]	S.F. Altschul; W. Gish; W. Miller; E.W. Myers; and D.J. Lipman. "Basic local alignment search tool." J Mol Biol, 215: 403-410, Oct 5 1990.
[34]	O. Snove, Jr., and T. Holen. "Many commonly used siRNAs risk off-target activity." Biochem Biophys Res Commun, 319: 256-263, Jun 18 2004.
[35]	T.F. Smith, and M.S. Waterman. "Identification of common molecular subsequences." J Mol Biol, 147: 195-197, Mar 25 1981.
[36]	W.J. Luo; H. Wang; H. Li; B.S. Kim; S. Shah; H.J. Lee; G. Thinakaran; T.W. Kim; G. Yu; and H. Xu. "PEN-2 and APH-1 coordinately regulate proteolytic processing of presenilin 1." J Biol Chem, 278: 7850-7854, Mar 7 2003.
[37]	T.A. Vickers; S. Koo; C.F. Bennett; S.T. Crooke; N.M. Dean; and B.F. Baker. "Efficient reduction of target RNAs by small interfering RNA and RNase H-dependent antisense agents. A comparative analysis." J Biol Chem, 278. 7I08-7II8, Feb 28 2003.
[38]	D. Semizarov; L. Frost; A. Sarthy; P. Kroeger; D.N. Halbert; and S.W. Fesik. "Specificity of short interfering RNA determined through gene expression signatures." PNAS, I00. 6347-6352, May 27, 2003 2003.
[39]	U.N. Verma; R.M. Surabhi; A. Schmaltieg; C. Becerra; and R.B. Gaynor. "Small interfering RNAs directed against beta-catenin inhibit the in vitro and in vivo growth of colon cancer cells." Clin Cancer Res, 9. I29I-I300, Apr 2003.
[40]	S.J. Dunn; I.H. Khan; U.A. Chan; R.L. Scearce; C.L. Melara; A.M. Paul; V. Sharma; F.Y. Bih; T.A. Holzmayer; P.A. Luciw; and A. Abo. "Identification of cell surface targets for HIV-I therapeutics using genetic screens." Virology, 32I. 260-273, Apr I0 2004.
[41]	M.S. Duxbury; H. Ito; M.J. Zinner; S.W. Ashley; and E.E. Whang. "CEACAM6 gene silencing impairs anoikis resistance and in vivo metastatic ability of pancreatic adenocarcinoma cells." Oncogene, 23. 465-473, Jan I5 2004.
[42]	R.R. Rajendran; A.C. Nye; J. Frasor; R.D. Balsara; P.G. Martini; and B.S. Katzenellenbogen. "Regulation of nuclear receptor transcriptional activity by a novel DEAD box RNA helicase (DP97)." J Biol Chem, 278. 4628-4638, Feb I4 2003.
[43]	R. Spagnuolo; M. Corada; F. Orsenigo; L. Zanetta; U. Deuschle; P. Sandy; C. Schneider; C.J. Drake; F. Breviario; and E. Dejana. "GasI is induced by VE-cadherin and vascular endothelial growth factor and inhibits endothelial cell apoptosis." Blood, I03. 3005-30I2, Apr I5 2004.
[44]	H. Nishitani; Z. Lygerou; and T. Nishimoto. "Proteolysis of DNA replication licensing factor CdtI in S-phase is performed independently of geminin through its N-terminal region." J Biol Chem, 279. 30807-308I6, Jul I6 2004.
[45]	H. Thonberg; C.C. Scheele; C. Dahlgren; and C. Wahlestedt. "Characterization of RNA interference in rat PCI2 cells. requirement of GERp95." Biochem Biophys Res Commun, 3I8. 927-934, Jun II 2004.
[46]	A. Gschwind; S. Hart; O.M. Fischer; and A. Ullrich. "TACE cleavage of proamphiregulin regulates GPCR-induced proliferation and motility of cancer cells." Embo J, 22. 24II-242I, May I5 2003.
[47]	O. Aprelikova; G.V. Chandramouli; M. Wood; J.R. Vasselli; J. Riss; J.K. Maranchie; W.M. Linehan; and J.C. Barrett. "Regulation of HIF prolyl hydroxylases by hypoxia-inducible factors." J Cell Biochem, 92. 49I-50I, Jun I 2004.
[48]	P. Yin; Q. Xu; and C. Duan. "Paradoxical actions of endogenous and exogenous insulin-like growth
factor-binding protein-5 revealed by RNA interference analysis." J Biol Chem, 279. 3266032666, Jul 30 2004.
[49]	C. Kanei-Ishii; J. Ninomiya-Tsuji; J. Tanikawa; T. Nomura; T. Ishitani; S. Kishida; K. Kokura; T. Kurahashi; E. Ichikawa-Iwata; Y. Kim; K. Matsumoto; and S. Ishii. "Wnt-I signal induces phosphorylation and degradation of c-Myb protein via TAKI, HIPK2, and NLK." Genes Dev, I8. 8I6-829, Apr I 2004.
[50]	D.W. Leung; C. Tompkins; J. Brewer; A. Ball; M. Coon; V. Morris; D. Waggoner; and J.W. Singer. "Phospholipase C delta-4 overexpression upregulates ErbBI/2 expression, Erk signaling pathway, and proliferation in MCF-7 cells." Mol Cancer, 3. I5, May I3 2004.
[51]	A.V. Pandey; S.H. Mellon; and W.L. Miller. "Protein phosphatase 2A and phosphoprotein SET regulate androgen production by P450cI7." J Biol Chem, 278. 2837-2844, Jan 3I 2003.
[52]	A. Malliri; S. van Es; S. Huveneers; and J.G. Collard. "The Rac exchange factor TiamI is required for the establishment and maintenance of cadherin-based adhesions." J Biol Chem, 279. 30092-30098, Jul I6 2004.
[53]	D. Girdwood; D. Bumpass; O.A. Vaughan; A. Thain; L.A. Anderson; A.W. Snowden; E. GarciaWilson; N.D. Perkins; and R.T. Hay. "P300 transcriptional repression is mediated by SUMO modification." Mol Cell, II. I043-I054, Apr 2003.
[54]	C. Brignone; K.E. Bradley; A.F. Kisselev; and S.R. Grossman. "A post-ubiquitination role for MDM2 and hHR23A in the p53 degradation pathway." Oncogene, 23. 4I2I-4I29, May 20 2004.
[55]	J. Ahn; M. Urist; and C. Prives. "Questioning the role of checkpoint kinase 2 in the p53 DNA damage response." J Biol Chem, 278. 20480-20489, Jun 6 2003.
[56]	H.K. Chung; Y.W. Yi; N.C. Jung; D. Kim; J.M. Suh; H. Kim; K.C. Park; J.H. Song; D.W. Kim; E.S. Hwang; S.H. Yoon; Y.S. Bae; J.M. Kim; I. Bae; and M. Shong. "CR6-interacting factor I interacts with Gadd45 family proteins and modulates the cell cycle." J Biol Chem, 278. 2807928088, Jul 25 2003.
[57]	M. Kullmann; U. Gopfert; B. Siewe; and L. Hengst. "ELAV/Hu proteins inhibit p27 translation via an IRES element in the p27 5'UTR." Genes Dev, I6. 3087-3099, Dec I 2002.
[58]	J. Harborth; S.M. Elbashir; K. Bechert; T. Tuschl; and K. Weber. "Identification of essential genes in cultured mammalian cells using small interfering RNAs." J Cell Sci, II4. 4557-4565, Dec 200I.
[59]	F. Sanz-Rodriguez; M. Guerrero-Esteo; L.M. Botella; D. Banville; C.P. Vary; and C. Bernabeu. "Endoglin regulates cytoskeletal organization through binding to ZRP-I, a member of the Lim family of proteins." J Biol Chem, 279. 3285832868, Jul 30 2004.
A Study of Fairness of Information Distribution and Utilization in a Mobile Agent-Based Adaptive Information Service System
Iftikhar Ahmed and Mohammad Jafar Sadiq,
Department of Electrical & Computer Engineering
Faculty of Engineering, International Islamic University Malaysia
E-mail: a.iftikhar@iiu.edu.my
Keywords: mobile agents, information systems, intelligent systems Received: July 19, 2005
The information handling capacity of wide area networks is ever increasing and becoming dynamic, thus posing a challenge for their optimum utilization. In such a dynamic environment it is very difficult to maintain a coherent picture of the information services. The current information services on the internet therefore need to be optimized. A novel technique proposed recently employs autonomous mobile agents in faded information field architecture (FIF) to facilitate information provision and servicing on wide area networks. The architecture is decentralized in nature whereby both the service provider and information seeker communicate through mobile agents. However, the degree of fairness of information distribution and access is still a problem to be addressed. This paper investigates the issue of fairness among information users for access to information in a networked environment employing mobile agent technology. A FIF was simulated employing a number of algorithms characterizing its behaviour with respect to fairness of information distribution by various service providers in the network. Simulations results indicate that in a given information system, both the service providers and the information seekers may be given equal access to the distribution and utilization of information provided a number of suitable algorithms are employed depending on the nature of information service available on the wide area network.
Povzetek: Narejena je analiza porazdelitve v agentnem informacijskem storitvenem sistemu.
1 Introduction
Information in today's business environment has become	environment of today. The expectations of users and the prime ingredient of a successful business enterprise.	customers have soared high and they seek a flawless any The rapid advances in Information and Communication	time any where service syndrome. Technologies (ICT) have given new impetus to the way	The traditional information services based on the clientbusiness is conducted on the internet with information	server model therefore cannot cope with the rising being the key element of a successful business enterprise.	demand placed by the complexities of data intensive The traditional paradigms of business, learning and other	heterogeneous computing. It is therefore imperative to essential services are undergoing an evolution. The	employ new techniques for optimum and accurate internet is a huge data repository that is expanding	information retrieval on the web. Various techniques without a central authority. The number of worldwide	being investigated for this purpose [ 3-5 ] mostly tend to internet users is predicted to exceed one billion by the	improve upon the current method of information end of 2005 [1]. Moreover 300 terabytes of information	retrieval, i.e., search engines Information fading based on is published online every year [2]. As the performance	a demand-oriented service architecture is one technique demands of the internet and its usage increase	that may bring about a conceptual change in the exponentially, the emphasis is shifting from platform	information search method in vogue [6], in addition to centric computing to network centric computing.	Semantic web [7]. The architecture balances the cost of Information systems have to be designed that are	information provision and utilization on the network by distributed, dynamic and have high assurance and fault	employing push and pull mobile agents to service user tolerance to meet the heterogeneous demands of users.	requests and conducts the business of information As ICT advances, the dynamics of e-commerce tend to	provision and utilization on the network. In the faded be more data intensive and complex. The traditional	information field (FIF) the service provider distributes business to customer paradigm is encompassing the	the most demanded and popular information contents business to business transactions as well. Companies	closer to its vicinity on various nodes in the network. The have to comprehend the trends and demands of the users	volume of distributed information on a node is inversely quickly in order to survive in the competitive	proportional to the distance of information storing node
from the service provider. Thus the FIF permits the fairness of information distribution to unspecified users with equal access time.
A number of attributes of FIF architecture have been investigated recently [8-11]. This paper will analyse the fairness of information access to various users in a FIF information system. The paper is organized as follows; Information service system performance requirements are discussed in section 2, followed by a description of FIF architecture. Mobile agents will be surveyed in section 4. The attribute of fairness of information distribution and utilization will be discussed in section 4. The paper will be concluded in section 5.
2 Information service system performance requirements
The wide area network services are gradually creeping into our daily lives with the number of users of these services constantly on the rise. The information systems now involve a constantly changing environment where stringent performance demands are placed on the network. The businesses are taking advantage of the wide area network facility to offer bargains and put various items on sales and special promotions resulting in the users having much more flexibility and choice available making the process of information provision and utilization a complex matter to deal with on the network. It is thus imperative for an information system to respond to both user needs and service provider (SP) requirements to respond in time to these needs in a rapidly changing environment. The main desirable attributes of such a system therefore should be flexibility, reliability and quick reaction time. Only a system with these properties can successfully satisfy both users and SPs in a dynamic network environment. The traditional data retrieval methodologies focused on the optimization of digging in huge data base to satisfy a search request. However, the FIF employs the concept of autonomous provision and utilization of information based on mobile agents. The amount of information to be faded away from SP is a function of network conditions like congestion, popularity of information contents and any other criteria programmed into mobile agents [6].
3 The FIF architecture
The information in a FIF is distributed on various nodes in the system. The information pattern in the field is analogous to the electromagnetic radiation from an antenna, high in intensity near the antenna and low in intensity away from the antenna. The SP fades away the information contents to the nodes in the field with the richness in information contents inversely proportional to the distance of the node from the SP through Push mobile agents (MAs).
The user on the other hand sends out Pull MAs seeking the desired information [6, 12]. The general schematic of FIF architecture is depicted in figure 1.
Pull MA

i-m-'-rn

Information Field Outer boundary
Network © Push MA Node
Information Gradient
- FIF -^
Information Fading
Figure 1: The layout of FIF architecture.
3.1	FIF System Components
The system essentially consists of logically connected nodes through which users and service providers correspond. Mobile agents are used by both parties to acquire and provide information respectively, under evolving/changing situation. The mobile agents (MA) generated by service providers are termed as push mobile agents (Push MAs). Push MAs carry out the function of autonomous coordination and negotiation with other nodes for information fading according to the network traffic status and the level of importance attached to the particular information contents. The level of importance of particular information content is based on its popularity, determined from a high hit rate of query by the information seeker on the network. The pull mobile agents (Pull MAs) are generated by users and they autonomously navigate in search of the required information on the network nodes in a step-by-step fashion. Once the required information is located, these agents report back to the information-seeking source. The push and pull MAs have no direct correspondence with each other.
The third important subsystem of a FIF is the node itself. It is a platform for both storage of information and program execution. It monitors the local information-based system conditions and autonomously makes decisions for allocation requests by the SP. It determines the upper directions to the information service in addition to response times on its upper directions. The upper directions or stages of information eventually point to the SP node whereas the lower directions to the information service lead to the outer nodes away from the SP. Each subsystem is autonomous in terms of control to execute its operations and coordination with other nodes under evolving network conditions.
3.2	Communication Format in the FIF
The conventional communications techniques cannot cope with the evolving conditions in a heterogeneous network environment where the state of nodes, the status of the SPs and the stability of connections are highly unpredictable as the user demand to access the
information is dynamic in nature. FIF therefore employs a different technique referred to as content Code communication technique [6]. The information contents in a FIF are uniquely defined by their content codes (CCs). These are further elaborately specified by their characteristic codes (CHs). Push MAs are sent out in the FIF by SPs specifying Content Codes (CCs) of information to the nodes using the message format as depicted in the FIF communication format example in Figure 3. The information about a university is structured according to the degree of importance in this example. For instance, SP1 specifies the university name followed by CH1 indicating the location of university with respect to a particular country and city. CH2 depicts the programs offered by the university and so on. The message format components thus lead to the breakdown of the principal source of information available on the SP into its uniquely defined characteristic codes (CHs). Push MAs are sent out in the FIF by SPs specifying Content Codes (CCs) of information to the nodes using the message format as shown in Figure 3. The selection of information storage/allocation is autonomously carried out by the nodes based on CHs. Similarly the Pull MAs sent out by the users search for the required information based on CHs.
S	C	C	C	C
P1	H1	H2	H3	H4
Universit
Locati
Progra
Fee
Start
The current distributed network environment is based on the traditional client-server paradigm. In the case of mobile agents employed in a network, the service provision and utilization can be distributed in nature and is dynamically configured according to the changing network performance metrics like congestion and user demand for service provision [7]. The two network environments are depicted in Figure 1 both for clientserver and mobile agents.
Mobile agents are typically suited to applications requiring structuring and coordinating wide area network and distributed services involving a large number of remote real time interactions [17].
Figure 2: The message format in FIF architecture.
4 A survey of mobile agents
A distributed environment is most suitable for the employment of mobile agents. The details of such applications in a distributed environment can be found elsewhere [13-15]. Only, a brief introduction of mobile agents and their role in a network based information system will be discussed in this paper. The term mobile agent is often context-dependent and has two separate and distinct concepts: mobility and agency. The term agency implies having the same characteristics as that of an agent. These are self-contained and identifiable computer programs that can move within the network from node to node and act on behalf of a user or other entity. These can halt execution from a host without human interruption [16].
^^^	Response
Mobile agent communication
Figure 3: A mobile agent is a network aware program that can conserve bandwidth in a complex adaptive information system environment.
5 Fairness of information distribution and utilization simulation
The purpose of Information fading in a network is twofold:
□	Congestion Reduction - The server hosting a popular site is highly likely to get congested. The information provider could employ additional machines (child nodes) to which the original server would distribute or fade its information contents. A pull MA entering the network of the parent server could then access any of the child nodes, which would service the agent's request for the desired information and therefore reduce the service request traffic load on the parent server (PS) in the network.
□	Fairness of Access - Service providers (whether commercial or non-commercial) would like their information contents to be easily accessible by the users on the network. The established and renowned service providers stand out compared to new competitors as search engines already index their search details. Not only the new
service providers suffer a long delay in getting their nomenclature indexed by a search engine, but they are the tail enders on the search engine index. This particular scheme of indexing by search engine creates disparity among Service providers on the network. Faded information field architecture therefore, can aid in creating fairness among various contenders of information provision and utilization on the network.
For congestion reduction, the service provider must use its own resources to create the field size since the congestion reduction benefits only the service provider. However, in a typical business application like advertising for example, each information provider cannot be expected to have the resources to create a field for itself alone. It is therefore possible that each server participates in the faded information field as being part of the field in addition to being a source of information. Thus, no one is the owner of the faded information field resources, and there is neither prohibition nor delay in becoming a member of the field.
For the sake of localization (prioritising a nearby service provider over a distant one), information should be faded in a geographic area as close as possible around the server. The faded information distance away from the SP was approximated using propagation delays in this simulation.
For the sake of fairness, the field size of all SPs should be equal. This is taken to be the number of nodes that constitute the field. Since each SP is autonomous in creating the field, there has to be an algorithm to ensure that each SP creates fields of the same size. The following are possible algorithms for an SP to autonomously determine its faded information field given that all SPs are aware of the nodes available for them to create FIFs. (In essence, these algorithms compute the FIF size with the given variables as described next):
•	Sorting. Each SP orders the list of available FIF nodes into a list sorted by delay (as a substitute for physical distance). It can thus determine how "far" it needs to send its push-MAs to create the standard field size, and imposes a travel restriction beyond the envisaged perimeter of its field. The push-MAs will be required to keep updating the field until they reach the travel restriction set by the SP (which can be in the form of travel time). Alternatively the push-MAs can be multicast to the required nodes in the field. The sorting technique is inherently resource intensive and it must be re-employed to account for cost update in the WAN as and when it occurs. However, it is expected that all SPs will have the same number of nodes in the field, since each one selects the same number of nodes from the sorted list.
•	Step Size. The SP determines the closest and furthest distances from the SP to the nodes, and
then divides that distance by the number of nodes to get the average 'hop' distance that a push-MA might take to travel from one node to the next. The inter-nodal step distance D is computed by using the following equation: D = (max (d) - min (d)) / (N - 1) * (R / 100 *N) + min (d)
Where D = step distance
d = distance from SP to node N = number of nodes R = required field size as a percent of total available nodes. The approximate distance that the SP envisages its push-MAs to cover in its FIF and reach a particular node is determined by the product of step distance and the nth node (DxN).
The SP will now send push-MAs that traverse the required distance. Alternatively, the server would determine which nodes fall in the required distance and multicast to those nodes. The step size must be recomputed every time there is an update in the system, but the major computation is finding the maximum and minimum distances, which is much faster than sorting.
Averaged Step Size. In order to account for skewing of SPs' distances, an average distance is computed for each server that is used to determine the step size instead. In this method, the determination of maximum and minimum distances is replaced by finding the total distance to all the nodes. The algorithm was simulated using the following equation:
D= (Xd / (N - 1) - min (d)) / (N - 1) * 2 * (R / 100.0 * N) + min (d)
Where D represents the inter-nodal distance, N is the total number of nodes in the FIF and d is the distance of the node currently being visited by the MA from SP.
Modified Averaged Step Size. The averaged step size algorithm still gets affected by the skewing of nodes. The algorithm was modified by computing the average step size of all the costs excluding the minimum and maximum costs of system information updates. In this method, the total has to be computed and the maximum and minimum has to be determined. The maximum and minimum are subtracted from the total. The average step size for push MA visits is computed using the following equation given below. The variables were incremented as the simulations iterations were performed to compute the step distance D of push MA visit to determine the FIF size.
D = ((Xd - Nmin * min (d) - N^ax * max (d)) / (N - 1 - Nmin - Nmax) - min2 (d)) / (N - 1 -Nmin - Nmax) * 2 * (R/100.0 * N - N^in) + min2 (d)
Where
Nrnm = number of nodes at the minimum distance
Nm
= number of nodes at the maximum
distance and min2 (d) = distance of the second closest node
The performance of each algorithm was tested by simulation using a wide area network with randomly distributed servers connected to randomly distributed gateways. The simulation involved 50 servers and 200 routers/gateways. Djikstra's algorithm [18] was used to determine the server-to-server costs. Under a standard policy each server was allowed to maintain a field size of 10 nodes. Figure 4 demonstrates the field size distribution for this wide area network.
- Step Size .....Mod. Average
• Averaged ■ sorting
45 40 35
U)
30 25 "Ü 20 t 15 i: 10 5 0
I
5	10
Size of Field
15
normal step size results, as expected. In the second simulation run, the sorting algorithm had an average field size of 10.2 nodes per information provider, the step size algorithm had 10.3, the averaged step size had 10.5 and the modified average had 10.8 nodes respectively. Also, the averaged step size algorithm gave a fair distribution of field size, with most of the servers within a certain range.
-Step Size .....Mod. Average
-Averaged ■ sorting
45 40 35
tn
30 25 20 t 15 10 5 0
I


5	10
Size of Field
15
Figure 5: Compensation for skewing.
-Step Size .....Mod. Average '
■	Averaged
■	sorting
30
25
Figure 4: Field size distribution versus the number of nodes in the network.
It was observed that the sorting algorithm gave the closest approximation to the required distribution. However, the step size method failed to give the required distribution when it was required to provide a field size of 10 nodes. The averaged step size method faired a little better. It gave better field distribution since the distances are skewed to the higher end thus diminishing the field size below the required size. However, the modified average method gave a very good average of 10.8 nodes per faded information field but the servers did not all have equal numbers of nodes in their fields in this case. In order to account for skewing, a multiplier was introduced into the algorithms for step size and averaged step size. The step size algorithm was set to produce a field size of 15 nodes and the averaged step size algorithm a field size of 16 nodes. The results are shown in figure 5.
It can be seen that the field sizes are better distributed with no processing overhead. The averaged step size results finally show a significant improvement over the
S 20
•S 15
10
l\ l\ I \

5	10
Size of Field
15
Figure 6: Field size without a multiplier.
The simulation was repeated using 150 routers instead of 200 to determine the effect of network size on the field size. The number of servers was kept the same. Figure 6 shows the field size without any multiplier and figure 7 depicts the field size with a multiplier. Again, the sorting and modified average algorithms both produced the required field size, whereas the other two algorithms did not. The step size algorithm this time needed only a
0
0
5
0
0
multiplier of 1.2 to give an acceptable average, but the average step size algorithm still needed a multiplier of 1.6. The latter produced a decent distribution of field size. Further simulation is required to determine the effects of network size on the above algorithms, as well as the effect of required field size. The preliminary simulations demonstrated that the sorting algorithm gives the best distribution of field size. Without a multiplier, the modified average algorithm gives the best performance, whereas, with a multiplier, the averaged step size algorithm gives the best distribution among the three non-sorting algorithms.
- Step Size .....Mod. Average
-Averaged ■ sorting
	30
	25
	
	20
E:	
e	
S	
f o	15
	
e	
.Q E	10
3	
■Z	
	5
	0
\
l\ I \
\
'/I
5	10
Size of Field
15
Figure 7: Field size with a multiplier.
5 Conclusion
Faded information field architecture has been reviewed in addition to a survey of mobile agents in this paper. A faded information field was simulated employing various algorithms to characterize its behaviour with respect to fairness of information distribution by various service providers in the network. The simulation provided a comparative analysis of algorithms to achieve this goal in addition to discovering a need to introduce a multiplicative factor to fine tune the performance of the algorithms evaluated for the purpose. It was discovered that without a multiplier, the modified average algorithm gave the best performance. But with a multiplier, the averaged step size algorithm turned out to be the best in distributing the information fairly among the three non-sorting algorithms. These results indicate that the faded information technique can be used to provide equal access rights to the information seekers and equal information distribution rights to the service providers by using the appropriate algorithm suited to the system requirement. Fairness of information provision and utilization can then be ensured on the network in addition to high reliability of information contents being updated regularly by the service provider.
Acknowledgement
This work was supported by the Research Centre
International Islamic University Malaysia.
References
[1]	L. G. Roberts," Beyond Moore's Law: Internet Growth Trends," IEEE Computer, Vol. 33, No. 1, p.117, 2000.
[2]	P. Lyman and H. R. Varian, "How much Information" Journal of Electronics Publishing, Vol. 6, Issue 2, December 2000.
[3]	Pavel Pereira Calado, Berthier Ribeiro-Neto," An information Retrieval Approach for Approximate Queries, " IEEE Transactions on Knowledge and Data Engineering, Jan-Feb 2003, Vol. 15, Issue. 1, p 236-239.
[4]	Kalervo Jarvelin, Juana Kekalainen," Cumulated Gain-based Evaluation for IR Techniques," ACM Transactions on Information Systems, Oct 2002, Vol. 20, Issue.4, p. 422-447.
[5]	Caroline M. Eastman; Bernard J. Jansen," Coverage, Relevance, and Ranking: the impact of query operators on Web search engine results," ACM Transactions on Information Systems, Oct 2003, Vol, 21, Issue 4, 383-412.
[6]	H. F. Ahmad, H. Arfaoui and K. Mori, "Autonomous Information Fading by Mobile Agents for improving User's Access Time and Fault Tolerance," Proceedings, 5th International Symposium on Autonomous Decentralized Systems, 2001. P.279-283.
[7]	Patrick Thibodeau," The Web's Next Leap," Computer world, 21 April, 2003.
[8]	I. Ahmed, J Sadiq," Characterization of Information Distribution Fairness in Mobile Agent-Based Adaptive Information System" Proceedings, Sixth International Conference on Web Integration and Services, September 27-29 2004, Jakarta, Indonesia, p. 761-770.
[9]	I Ahmed, J. Sadiq," Nodal Congestion Analysis of an Autonomous Mobile Agent-Based Distributed Information Service System for Efficient Utilization of Information Contents," International Journal of Information Technology, 2004, Vol.10, No.1, p.64-70.
[10]	I. Ahmed, J Sadiq," A Novel Autonomous Architecture for the Integration of Information and control in an Automated Manufacturing System," Proceedings, International Conference on Intelligent Agents, Web Technologies and Internet Commerce, 12-14 July 2004, Gold Coast Australia, p. 260-267.
[11]	I Ahmed," Balancing the user demand and information provision in a Web-based linguistic learning system employing Faded Information Field Architecture," Proceedings, ASIA CALL International Conference on Information Technology and Linguistic Learning, Bangkok, Thailand, 3-5 December 2003, p. 8-13.
0
[12]	H. Arafoui and K. Mori," Autonomous Navigation Architecture for Load Balancing User demands in Distributed Information Systems," lEICE Transactions on Communications, 2001, E84-B (10), p.1085-1093.
[13]	A. Karmouch and V. A. Pham, "Mobile Software Agents: An Overview," IEEE Communications Magazine, July 1998, Vol. 36, No. 7.
[14]	M. K. Perdikeas et al, "Mobile Agents Standards and Available Platforms, Comp.Net" 1999, Vol. 31, No. 19, p. 1999-2016.
[15]	D. B. Lange and M. Oshima, "Seven Good Reasons for Mobile Agents," Communications of the ACM, 1999, Vol. 42, No. p. 88-89.
[16]	J. Cao, G. H. Chan, W. Jia, and T. Dillon, "Checkpointing and Rollback of wide-area Distributed Applications using Mobile Agents," Proceedings International Parallel and Distributed Processing Symposium. San Francisco: IEEE Computer Society Press, April 2001.
[17]	S. Funfrocken," Integrating Java-based Mobile Agents into Web Servers under Security Concerns," Proceedings Hawaii International Conference on System Sciences, 1998, p.34-43.
[18]	Marta Steenstrup," Routing in Communications Network," Prentice Hall New Jersey USA, 1995, p.143-144.
Isolated Words Recognition System Based on Hybrid Approach DTW/GHMM
E- Hocine Bourouba, Mouldi Bedda and Rafik Djemili Department of electronic
Faculty of Engineering, University of Annaba, Algeria Automatic and Signals Laboratory
E-mail: {Bourouba2004, mouldi_bedda, djemili_r}@yahoo.fr
Keywords: speech recognition, hidden Markov models, dynamic time warping, hybrid system Received: February 11, 2005
In this paper, we present a new hybrid approach for isolated spoken word recognition using Hidden Markov Model models (HMM) combined with Dynamic time warping (DTW). HMM have been shown to be robust in spoken recognition systems. We propose to extend the HMM method by combining it with the DTW algorithm in order to combine the advantages of these two powerful pattern recognition technique. In this work we do a comparative evaluation between traditional Continuous Hidden Markov Models (GHMM), and the new approach DTW/GHMM. This approach integrates the prototype (word reference template) for each word in the training phase of the Hybrid system. An iterative algorithm based on conventional DTW algorithm and on an averaging technique is used for determined the best prototype during the training phase in order to increase model discrimination. The test phase is identical for the GHMM and DTW/GHMM methods. We evaluate the performance of each system using several different test sets and observe that, the new approach models presented the best results in all cases
Povzetek: V sestavku so opisane hibridne metode za prepoznavanje besed.
1 Introduction
Automatic speech recognition has been an active research topic for more than four decades. With the advent of digital computing and signal processing, the problem of speech recognition was clearly posed and thoroughly studied. These developments were complemented with an increased awareness of the advantages of conversational systems. The range of the possible applications is wide and includes: voice-controlled appliances, fully featured speech-to-text software, automation of operator-assisted services, and voice recognition aids for the handicapped^.
Different approaches in speech recognition have been adopted. They can be divided mainly in three trends namely Dynamic Time Warping (DTW), Hidden Markov Models (HMM), and Artificial Neural Networks (ANN).
The introducing of speech HMM has made an impact and has enabled great progress during these last few years. However, there is a lot to be accomplished in this area in order to improve their quality, i.e. the re-enforcing of the discrimination between different models, which seems to be very promising.
In the 1990's, a fourth technique called Hybrid Approach was introduced. The combination of the multiple methods produced a more precise final result because it exploited the advantages of each one. This Combination seems to constitute an interesting approach in speech recognition.
Most the new speech recognition systems are now based on hybrid approach HMM/ANN. HMM has a great capacity to treat events in time, while ANN is an expert in the classification of static forms.
The main solution s suggested to compensate the lack of discrimination in the Markov models come in the model training phase. An alternative approach consists in a local introducing of the discrimination in the model's definition. Among existing methods, the utilizing of ANN's as a discriminating probability estimator has proven to be efficient; nevertheless, it is costly and difficult to put into action. Re-enforcing discrimination techniques between models by a re-estimation of model parameters based on a ANN discriminating criteria are complex, and don't provide a guarantee of convergence for the learning procedure. The approach we propose relies on the principle that the global discrimination between Markov models can be obtained from a discrimination of the models training sequences, and that by a transformation the representing space using the time alignment. Thus, we have developed an iterative algorithm to extract a most suitable prototype favoring the discrimination among the data classes from the training set the derived criteria can be summarized as follows: after the alignment of the sequences of each class by its prototype, each class becomes the most
regrouped possible, and the of classes the most dispersed possible.
The work presented in this paper is an alternative hybrid approach DTW/HMM used in speech recognition using hidden Markov model with DTW algorithm. The goal of this work is to apply DTW to solve the lack of discrimination in the Markov models. A basic idea is that even if DTW has been proven successful in modeling the temporal structure of the speech signal, it is not capable of assimilating a wide variety of speaker dependent spectrum pattern variations; on the other hand, training HMMs for recognizing spoken words is not discriminate So, for example, combining the high time alignment capabilities of DTW with the flexible learning function of the HMM is expected to lead to an advanced recognition model suitable to isolated speech recognition problems.
The new approach GHMM/DTW is introduced, evaluated and compared with traditional approach GHMM for isolated word recognition system. Both these approaches apply the same principles of feature extraction and time-sequence modeling; the principal difference lies in the architecture used for training phases.
The rest of the paper is organized as follows. In the next section, we introduce the acoustic modeling used in our experiments. In section 3 and 4, we discuss some aspects of GHMM and DTW Section 5 then present the existing hybrid system. In section 6 , we will discuss more amply the hybrid approach with the iterative algorithm based the DTW technique . In the next section we presented the experiments examine the performance of GHMM and DTW/GHMM on Arabic and French isolated word. Finally, section IV gives a summary and conclusion.
2 Feature extraction
In this phase speech signal is converted into stream of feature vectors coefficients which contain only that information about given utterance that is important for its correct recognition. An important property of feature extraction is the suppression of information irrelevant for correct classification, such as information about speaker (e.g. fundamental frequency) and information about transmission channel (e.g. characteristic of a microphone). The feature measurements of speech signals are typically extracted using one of the following spectral analysis techniques: MFCC Mel frequency filter bank analyzer, LPC analysis or discrete Fourier transform analysis. Currently the most popular features are Mel frequency cepstral coefficients MFCC [3].
2.1 MFCC Analysis
The Mel-Filter Cepstral Coefficients are extracted from the speech signal as shown in the block diagram of Figure 1. The speech signal is pre-emphasized, framed and then windowed, usually with a Hamming window. Mel-spaced filter banks are then utilized to get the Mel-spectrum. The natural Logarithm is then taken to
transform into the Cepstral domain and the Discrete Cosine Transform is finally computed to get the MFCCs. Figure 2 shows the Mel-spaced filter banks that are used to get the Mel-spectrum.
i\
Ck =^log(Ei)xcos
(i-i)
(1)
The following denotes the acronyms used in the block diagram:
-	W : Frame Blocking and Windowing
-	FFT: Fast Fourier Transform
-	LOG: Natural Logarithm
-	DCT: Discrete Cosine Transform
i-in
Spaefli i
w ^
[.(Xj W [xn-
T
M[-C'Cs
Figure 1: Mel-scale cepstral feature analysis.
Uaiu^liM kill
Figure 2: Mel-Spaced Filter Banks.
2.1.1 Pre-emphasis
In general, the digitized speech waveform has a high dynamic range. In order to reduce this range pre-emphasis is applied. By pre-emphasis [1], we imply the application of a high pass filter, which is usually a first -order FIR of the form H(z) = 1 - a x z .
The pre-emphasize is implemented as a fixed-coefficient filter or as an adaptive one, where the coefficient a is adjusted with time according to the autocorrelation values of the speech. The pre-emphasizer has the effect of spectral flattening which renders the signal less susceptible to finite precision effects (such as overflow and underflow) in any subsequent processing of the signal. The selected value for a in our work is 0.9375.
2.1.2 Frame blocking
Since the vocal tract moves mechanically slowly, speech can be assumed to be a random process with slowly varying properties [1]. Hence, the speech is divided into overlapping frames of 20ms every 10ms. The speech signal is assumed to be stationary over each frame and this property will prove useful in the following steps.
Figure 3 : Frame blocking Step. 2.1.3 Windowing
To minimize the discontinuity of a signal at the beginning and end of each frame, we window each frame frames [1]. The windowing tapers the signal to zero at the beginning and end of each frame. A typical window is the Hamming window of the form:
w(n)=0.54-0.46cos{Nn1)
0<N<N-1 (2)
S={Si} and of the observation string produced as a result of emitting a vector Ot for each successive transitions from one state S, to a state Sj. Ot is d dimension and in the discrete case takes its values in a library of M symbols. The state transition probability distribution between state Si to Sj is A={aij}, and the observation probability distribution of emitting any vector Ot at state Sj is given by B={bj(Oi)}. The probability distribution of initial state is n={ %}.
aj= P(q+1 = S^q, = S, ) (3) bj (O, ) = P(0\q, = Sj ) (4) n i = P(q0 = Si )	(5)
Then, given a observation sequence O, and a HMM model X=(A,B,ni), we can compute P(O|'k) the probability of the observed sequence by means of the forward-backward procedure [10]. Concisely, the forward variable is defined as the probability of the partial observation sequence O],O2 ,^,Ot (until time t) and state S, at time t, with the model 1, as at(i). And the backward variable is defined as the probability of the partial observation sequence form t+1 to the end, given state S, at time t and the model 1, as ßt(i). The probability of the observation sequence is calculated as:
N	N
P(O|2) = (i)ßt (i) = E aT (i) (6)
3 Hidden Markov model
3.1	Introduction
A Hidden Markov Model (HMM) is a type of stochastic model appropriate for non stationary stochastic sequences, with statistical properties that undergo distinct random transitions among a set of different stationary processes. In other; words, the HMM models a sequence of observations as a piecewise stationary process. Over the past years, Hidden Markov Models have been widely applied in several models like pattern [4,5], or speech recognition [6, 7]. The HMMs are suitable for the classification from one or two dimensional signals and can be used when the information is incomplete or uncertain. To use a HMM, we need a training phase and a test phase. For the training stage, we usually work with the Baum-Welch algorithm to estimate the parameters ( n i ,A,B) for the HMM [8, 9]. This method is based on the maximum likelihood criterion. To compute the most probable state sequence, the Viterbi algorithm is the most suitable.
3.2	Basic HMM
A HMM model is basically a stochastic finite state automaton, which generates an observation string, that is, the sequence of observation vectors, O=O1,..Ot ,OT. Thus, a HMM model consists of a number of N states
and the probability of being in state i at time t, given the observation sequence O, and the model 1, as:
rt (i) =
a, (i)ßt (i)
P(OX)
(7)
The ergodic or fully connected HMM is a HMM with all states linked all together (every state can be reached from any state). The left-right (also called Bakis) is an HMM with the matrix transition defined as:
aj = 0 if j <i aj = 0 if j < i + A (8)
b1(01)
b< ( Ot )

I I I I I I I
Figure 4: A Bakis (or left rigth) HMM.
We adjust the model parameter 1=(A,B, n) to maximize the probability of the observation sequence. Consequently, given W classes to recognize, we need to train for w=1...W HMM, one for each class, with the data set corresponding to the class w. We accomplish the above task using the iterative Baum-Welch method, which is equivalent to the EM (Expectation-Modification) procedure.
The Baum-Welch method, developed in this work as follows:
1.	Estimate an initial HMM model as 1=(A,B, n).
2.	Given 1 and the observation sequence O, we calculate a new model ) , À =(A,B, n.) such as:
T -1
T 7, ( j, k)
^jk T M
zzr, ( j, k )
,=1 ,=1
(13)
ßjk = "^^r
T-1
Tr, ( j, k ) X 0,
(14)
Tr, ( j, k )
i=\
T -1
Tr,(j,k)(0, -Mjt)(0, -Mjt)'
U j. =
(15)
Tr, ( j, k )
i=1
P(0|2) >P(0|2) (9) 3. If the improvement
( j, k ) =
a, ( j)ß, ( j)
Ta, ( j)ß, ( j)
,=1
C k N(0, ,ß,U )
T c,.„ N(0,, ß,U )
(16)
P(0	-	P(0	
P(0			
-< threshold (10)
then stop, otherwise put À, instead of 1 and go to step 1.
In the GHMM case a gaussian mixtures density is a weighted sum of M component densities, given by the equation
M
b(0, / Ä) = T wfij (0, ) (11)
i=1
where o, (, = 1.......T ) is a D-dimensional random
vector, b^ (o, ), i =1..............,M , are the component
densities and w., i =1.............., M , are the mixture
weights. Each component density is a D-variate gaussian function of the form
b (o'	^o<)t T-1 (o,-ß, (12)
ith mean vector ß and covariance matrix Ti. The mixture weights satisfy the constraint that TiM=1 wi =1 .
The complete Gaussian mixture density is parameterized by the mean vectors, covariance matrices and mixture weights form all component densities. These parameters are collectively represented by the notation
X = {w.,Ti}i = 1.............,M . The GMM can have the
several different forms depending on the choice of covariance. The model can have full or diagonal matrix. In this paper the full and diagonal covariance matrix are used for word recognition. In the GHMM, the Baum-Welch algorithm estimates the means and variances for the mixture of Gaussians
The Viterbi algorithm can be used to obtain the estimation of the most probable state sequence. Once all
the HMMs	.............) are correctly trained, to
classify a sequence for the observation O, Fw =P(O|1w) is calculated for all the 1w. The unknown observation O is then classified by the process:
w = argmax(pw ) (17)
1<w<W
O.
And so, w* is the optimum class for the observation
The initialization and stop criteria must be chosen adequately for the HMM. It directly interacts on the relevancy of the HMM [11]. Equiprobable and equal occupancy methods for the initial models are provided as well as iteration and rate of the error for the stop criterion.
4 Dynamic time warping dynamic
The Dynamic Time Warping (DTW) distance measure is a technique that has long been known in speech recognition community. It allows a non-linear mapping of one signal to another by minimizing the distance between the two.
Dynamic Time Warping is a pattern matching algorithm with a non-linear time normalization effect. It is based on Bellman's principle of optimality [12], which implies that, given an optimal path w from A to B and a point C lying somewhere on this path, the path segments AC and CB are optimal paths from A to C and from C to B respectively. The dynamic time warping algorithm [12] creates an alignment between two sequences of feature vectors, (T1, T2,.....Tn) and (S1, S2,....,Sm).
A distance d(i, j) can be evaluated between any two feature vectors Ti and Sj . This distance is referred to as
m=1
the local distance. In DTW the global distance D(i,j) of any two feature vectors Tj and Sj is computed recursively by adding its local distance d(i,j) to the evaluated global distance for the best predecessor. The best predecessor is the one that gives the minimum global distance D(i,j) at row i and column j:
D(i, j) = min [[(m, k)]+ d (i, j) (18)
m<i,k< j
The computational complexity can be reduced by imposing constraints that prevent the selection of sequences that cannot be optimal [13]. Global constraints affect the maximal overall stretching or compression. Local constraints affect the set of predecessors from which the best predecessor is chosen. Dynamic Time Warping (DTW) is used to establish a time scale alignment between two patterns. It results in a time warping vector w, describing the time alignment of segments of the two signals. assigns a certain segment of the source signal to each of a set of regularly spaced synthesis instants in the target signal.
5	Overview of hybrid system in speech recognition
In order to overcome the unsatisfying performance of speech recognition systems based DTW, HMM or ANN, researchers have attempted to combine these methods. The majority of the researchers combine the models of Markov hidden HMM with the networks of neurons ANN. Several researchers have explored some hybrid system of HMMs and neural networks, the majority are constructed by sending the output of a neural networks to a HMM post processor [14,17] , several others propose a NN architecture that can emulated a HMM[18], alternatively [19] uses the NN to restore the N-best hypotheses produced with a HMM. In [14,16,20,21] the outputs of the NN are not interpreted as probabilities, but rather are used as scores and generally combined with dynamic programming. In [21,24] a network per class or per state is trained to predict the next input frame given only a few previous frames. Amore recent hybrid predictive system is proposed in [25], where network per word vocabulary is created and trained to predict the next input frame given the previous one, the predicted errors summed over all frames are used as a recognition score. In [26] we propose a method which extends the VQ distortion method by combining it with the likelihood of the sequence of VQ indices against a discrete hidden Markov model (DHMM). The scores have to be combined in such a way that the coherence of the two sources is maximized and their differences minimized.
In [27] we combine Hidden Markov Models of various topologies and Nearest Neighbor classification techniques using DTW algorithm.
6	New system DTW/GHMM
We combine HMM and DTW in a modeling framework. HMM can capture the statistical characteristics of word
and subword units among different speakers even in large vocabulary and thus is generally better than DTW in speaker independent large vocabulary speech recognition. However, there are useful applications of DTW in small vocabulary, isolated word, speaker dependent or multi-speaker speech recognition due to its relative simplicity and good recognition performance in these situations. DTW system can capture long-rang dependencies [1] in acoustic data, and can potentially adapt to differences in speaker, and accent [2]. According to the analysis above, DTW is effective in wide scale observation and HMM is suitable for solving the analysis of the detail. Hence it is feasible to devise a recognition system which combines these two methods.
The idea is to generate reference patterns for the words in the recognition vocabulary based on training data and then to aling all training data with them. The iterative algorithm was able to find a best reference templete that obtained over significant differences between training sets.
In the traditional HMM system each word is represented by a distinct HMM. In the training stage, each utterance is converted to the cepstral domain (MFCC features, energy, and first and second order deltas) which constitutes an observation sequence for the estimation of the HMM parameters associated to the respective word. The estimation is performed by optimizing the likelihood of the training vectors corresponding to each word in the vocabulary. Typically, the optimization is performed using the Baum-Welch algorithm or equivalently the EM (Expectation-Maximization) algorithm [1]. In the recognition stage, the observation sequence representing the word to be recognized is used to compute the likelihoods, for all possible models, that the sequence has been generated by these models. The recognized word corresponds to the one associated to the model with the highest likelihood. In this stage the Viterbi algorithm, is employed.
In the new system same steps are used as traditional system. In the training stage, feature vectors corresponding to the data samples are processed in order to generate a prototype pattern vector for each word. This is done by computing the centroid of the feature vectors associated to all the training occurrences of each word and then is used to normalize the training set using DTW algorithm to produce a stable set of clusters for which, a, the ratio of average intercluster distance to average intracluster distance was maximized.
In the recognition stage don't need to align recognized utterance by prototypes, because it's time consuming and return the recognition step very complex.
In the hybrid approach, HMM model for each digit, is generated as follows:
1-	Calculate the prototype template from the training data by iterative algorithm.
2-	Use the Dynamic Time Warping technique to align all the training data with the prototype template.
3-	Once he training data are aligned, then used to Train HMM model using Baum welsh algorithm.
According to description above, the architecture of the traditional and new system are shown on the following figures (5,6).
Figure 5: Training Step. A- Hybrid system. B- Baseline system.
7 New system DTW/GHMM
The choice of the prototype templates will affect the performance of the recognition process. Two methods commonly used are to choose the cluster member that minimizes the distance to all other members of the cluster, or to simply average the members of the cluster. The advantage of the latter method is that it smooths out noise that may be present in any individual data item. Unfortunately, it is only workable when the cluster elements are embedded in a metric space (e.g. Cartesian space). Although we cannot embed cluster elements in a metric space, DTW allows us to use a combination of the two methods. The details of the algorithm are now presented as an iterative algorithm:
1-	First, we select the utterance from the training data that minimizes distance to all other utterances in a given cluster.
2-	Then we warp all other patterns into that centroid, resulting in a set of patterns that are all on the same time scale.
3-	It is then a simple matter to take the average value at each time point over all of the series and use the result as the cluster prototype.
7.1 Introduction
We assume that L finite sets Q^ are given, with Nf patterns each (repetitions of the same word). The set
of training data is q = [j q, .
1=1
Where Q, = (y^ j,y, ^,...........}, L is the number
of words in vocabulary. y,,; is a pattern representing the
i th repetition of the lth word. A pattern y is assumed to consist of F frames with P features each. If we denote the i'^ frame of y as y(i), then we can represent y as the set
of vectors y = [y(1), y(2)..........., y(F)}.
The MFCC feature, y(i), are computed from the MFCC coefficients by the relation (1). Since the iterative algorithm is based on distance data, a distance dij between patterns Xi and yj and warping function are computed by :
d^J,w(')}=DTW(x^,yj) (19) Where Xi(t)=yj(w(t))
The function w(t) is the warping function obtained from a dynamic time warping (DTW) match of pattern x, to yj, which minimizes the total distance over a constrained set of possible w(t).
A flow diagram of the iterative algorithm is given in figure 07
Figure 6 : Recognition Step.
Co{ 1} is the word y^^ such that the maximum distance to any another word in Q ^ is minimum. Since all distances of any word in Q ^ are computed and stored in D, minimax computations of the type given in Eq. (21) are especially simple to implement. These steps are given in Figure 08
Figure 7: Steps of iterative algorithm.
7.2 Proposed Iterative Algorithm
Assume that for Q ^ 1 =1,2_.,L ,sets (word patterns),
the raw data y^^ = Q^ {i}, i=1,2,........ N^ are to be
aligned with the center cluster C{ l } to product new training set ^^ . With the above definitions the proposed algorithm is described as follows:
7.2.1 Determination of the minmax center C0{ 1} of the observation set Q ^ :
For each set Ql :
Compute a matrix of distance D:
[ D, (i, j) ,w(t)]=DTW( Qi {i}, Qi {/■}) (20)
Compute cluster center C0{ l } using :
C0{} = Qi{/jif max D(i,m)is min (21) 1 < m < N,
Figure 8: The first step in the iterative algorithm.
7.2.2	Compute the new training set W,
Align the patterns in each cluster Q,, to the length of the prototype C0 {,}. Replace all patterns yg^ i=1,2_ N,, with the corresponding warped patterns 3~g. using
[d, w(t)]=DTW( Qg {/■}, C0 {,}) (22)
'yi,i = yg,/. (w(t)) , t=1,......F, ( F, is the frames
number of C0 {,}). Where w(t ) is vector contain the indices frame which y,,, = C0 {,}. So that in each set, the time length (number of frames) of all patterns become equal. Therefore W, is the new set:
W<={)y<,1,3~<,2,.....,3~<,Ng} , W, {/■}= ^yn (23).
7.2.3	Compute the prototype cluster
Compute the Prototype cluster (cluster center) of the entire patters set W, . The cluster center is computed by averaging:
CG(e)=-N-;Ty'- (24)		Vocabulary
	V1	(1-10) digits in French
7.2.4 Recomputed the new training set Again align the patterns in each cluster Qt, to the length of the prototype Ca {t}. Replace all pattern >'t,i	V2	(Adrar, Chief, Laghoua,, Oum el bouaghi, Batna, Béjaia, Biskra, Béchar, Blida, Bouira)
i=1,2_ Nt, with the corresponding warped patterns xt,i using [d, w(t)]=DTW(.Vt, , Cg{t}) (25)	V3	(Tamanrasse,, Tébéssa, Tlemcen, Tiare,, Tizi ozou, Alger, Djelfa, Jijel, Sétif, Saida
xt. = (wC)) , t=1, Ff (Ft is the frames number of Cg {t}). Where w(() is vector contain the	V4	(Skikda, Sidi Bel Abbes, Annaba, Geulma, Constantine, Médéa, Mostaghanem, Msila, Mascara, Ouargla)
new set:
^t ={xi,\, xi,2,....., xi,Nf (23) •
8 Experimental Evaluation
This section presents the experimental evaluation of GHMM and DTW/GHMM approaches for spoken word recognition. Two databases (French and Arabic) were used for the training and testing. The first database is the Digits Corpus from the National Laboratory of Automatic and Signals in The University BADJI-MOKHTAR Annaba Algeria. The data is sampled at 10 KHZ sampling rate and digitized to 8-bit resolution. A subset of the database used in our experiments comprised a small vocabulary spoken by 10 speakers (8 males and 2 females) and test data spoken by 15 different speakers (11 males and 4 females). There are utterances 300 in the training sequence and 600(400 (4 for each training speaker) + 200(4 for another speaker)) in the testing sequence. The second database comprises 48 isolated Arabic words is sampled at 10 Hz and digitized to 8-bit resolution. Here we used only a subset of 10 words. There are 10 speakers (2 male and 8 female) in the database and each word was repeated 5 times by the Speakers. The three first one repetitions were used as the training set and the rest as the testing set. There are 1440 utterances in the training sequence and 960 in the testing sequence. The feature extraction procedure for both databases is the same.
The vocabulary to be recognized is composed by the ten French utterances of the digits from zero to nine and ten Arabic utterances of the states name in Algeria as following:
Notice: The words signals were recorded in a room without any special acoustic protection. Repetitions from one speaker were done in different days with a different type of microphone).
For comparison purposes, we have been using systems based on different kinds of acoustic feature:
D1=(12 MFCC)
D2=(12 MFCC + E)
D3 = ((12 MFCC + E)+ A
D4 = ((12 MFCC + E)+ A+ AA
8.1 Results and Discussion
The tables below shows the various results obtained for the two developed systems of traditional GHMM recognition and hybrid (DTW/GHMM) applied to the different vocabularies:
Table1: the tests results with 1 mixture and 5 stats
	Traditional system			
	D1	D2	D3	D4
V1	75.00	80.25	85.00	86.25
V2	56.66	58.66	65.33	67.33
V3	75.33	81.33	84.00	84.66
V4	70.00	76.00	79.33	82.00
V5	78.00	79.33	84.60	86.66
Table2: the tests results with 1 mixture and 5 stats
Hybrid system				
	D1	D2	D3	D4
V1	77.00	91.25	92.25	93.50
V2	60.00	65.33	76.66	83.33
V3	83.33	86.00	86.66	88.33
V4	76.66	80.00	84.60	85.33
V5	82.00	85.33	86.00	90.00
Table 1 and 2 report respectively test results of conventional HMMs and DTW/HMM algorithm, where HMMs have 5 states and 1 mixture components for different vector coefficients (D1.D4) and vocabulary's(V1 ..V5).
The variation of performance raised about 2-10 % between the system GHMM and GHMM/DTW are observed for the registered test set. From the experiments above, we know that DTW/GHMM has better performance than conventional HMMs.
The main advantages of our method are the following:
❖ Our experiments show, that the alignment of two sequences of the same word with respect to its class prototype result in a decrease of the distance between the two sequences before being aligned (figure 09).
Let x1 and x^ e Qi
d1<d2
d1=DTW( x1 , x2 )
d2=DTW( S1 , X2 ) with {Ìd,w1(t)]=dtw(x1,CG(1)} ~1= x1(w1(t))
{[d,w2(t)]=dtw(x2,CG(1)} X2 = x2(w2(t))
Figure 9: distance and time warping function of two utterances of the same word before and after alignment
❖ On the other hand the fact of aligning two different word sequences with their respective class prototype has the effect of increasing the distance between the two sequences after being aligned (figure 10).
Let xeQi1 and yeQ.a
d1>d2
d1=DTW( x, y )
d2=DTW( x, y ) with {[d,w1(r)]=drw(x ,Cg(11)} ~=x(m(r)) {[d,w2(r)]=drw(y,CG(£ 2)}
37=y(w2(t))
Figure 10: distance and time warping function of two different patterns before and after alignment
❖ Combing the two previous properties results increase of intra-cluster correlation and a good inter cluster discrimination. In addition the time alignment of two training sets X1 and X2 representing two different words produces two new training sets Y1 and Y2 that are easily discriminated. As a consequence Y1 and Y2 are more effective differentiating HMM models Ä1, Ä2.
Time-warping all the utterances in the training set (cluster) to the same duration as a central template is used to improve the training process. The time-normalized utterances improve the ability of the baum welsh algorithms to learn the data, because the average intercluster distance to the average intracluster distance is maximized after alignment of the training sequences with respect to the prototype, this favors the discrimination of models training sequences which result a discrimination of the models. On the other hand, the fact the classes have their own sizes after the alignment will increase the discrimination of the models particularly at the transition matrix A level. This is especially important for words which are phonetically close to each other.
9 Conclusion
This paper presents the new DTW/GHMM system in isolated speech, where classical DTW and HMM is combined. In the training stage we define the prototype set and introduce iterative algorithm as the solution to build the best prototypes which favors the discrimination between the training sets to give discriminates models in the vocabulary space. The experiments show that the DTW/GHMM system increases the average recognition rate by 2-10% more than the HMM-based recognition method. Though the methods proposed in this paper got better performance, there are still some issues to be further investigated. If explicit effective features can be extracted, the recognition may have a better performance. It is a challenging issue that deserves further study.
References
[1]	L. Rabiner, and B.H.Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, New Jersey, 1993.
[2]	M. De Wachter, K. Demuynck, "Data driven example based continuous speech recognition" in Proc. Eurospeech, Geneva, Switzerland, Semptember 2003
[3]	Q. Zhu, A. Alwan, "On the use of variable frame rate analysis in speech recognition", Proc. IEEE ICASSP, Turkey, Vol. III, p. 1783-1786, June 2000.
[4]	J. A. Sanchez, C. M. Travieso, I. G. Alonso, M. A.Ferrer, Handwritten recognizer by its envelope and strokes layout using HMM's, 35rd Annual 2001 IEEE Internacional Carnahan Conference on Security Technology, (IEEE ICCST'01), London, UK, 2001, 267-271.
[5]	M. A. Ferrer, J. L. Camino, C. M. Travieso, C. Morales, Signature Classification by Hidden Markov Model, 33rd Anual 1999 IEEE Internacional Carnahan Conference on Security Technology, (IEEE ICCST'99), Comisaria General de Policia Cientifica, Ministerio del Interior, IEEE Spain Section, COIT, SSR-UPM, Seguritas Seguridad Espana S.A, Madrid, Spain, Oct. 1999, 481-484.
[6]	Renals, S., Morgan, N., Bourlard, H., Cohen, M. & Franco, H. (1994), Connectionist probability estimators in HMM speech recognition, IEEE Transactions on Speech and Audio Processing 2(1),1994, 161-174.
[7]	L.R. Bahl, P.F. Brown, P.V. de Souza, and R.L. Mercer, Maximum mutual information estimation of HMM parameters for speech recognition,. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, , Tokyo, Japan, December 1986,49-52
[8]	L. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the statistical analysis of probablistic functions of Markov chains.
The Annals of Mathematical Statistics, 41(1), 1970, 164-171.
[9]	L. Baum, An inequality and associated maximization technique in statistical estimation for probalistic functions of Markov processes. Inequalities, 3, 1972, 1-8.
[10]	L. R. Rabiner. Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition Readings in Speech Recognition, chapter A, 1989,267-295.
[11]	M.A. Ferrer, I. Alonso, C. Travieso, "Influence of initialization and Stop Criteria on HMM based recognizers" , Electronics letters of IEE, Vol. 36, June 2000, 1165-1166.
[12]	R. Bellman and S. Dreyfus, "Applied Dynamic Programming". Princeton, NJ: Princeton University Press, 1962.
[13]	H. Silverman and D. Morgan, "The application of dynamic programming to connected speech
recognition" IEEE ASSP Magazine, vol. 7, no. 3, pp. 6-25, 1990.
[14]	Y. Bengio, "Artificial Neural Networks And Their Application To Sequence Recognition" PhD Thesis, McGill University, Montreal, cannada, 1991
[15]	Bourlard H. and Wellekens C. J., "Speech Pattern Discrimination and Multilayer Perceptrons," Computer Speech and Language, vol. 3, pp. 1-19, 1989.
[16]	P.Haffer M. Franzini A.waibel "Integrating Time Alignment And Neural Networks For High Performance Continuous Speech Recognition" Proc of the ICASSP' 91, pp.105-108, Torento, 1991.
[17]	Morgan N. and Bourlard H., "Continuous Speech Recognition Using Multilayer Perceptrons with Hidden Markov Models," in Proceedings of IEEE ICASSP, vol. 2, pp. 26-30, Albuquerque, 1990.
[18]	J.S. Bridle "Training Stochastic Model Recognition Algorithms As Networks Can Lead To Maximum Mutual Information Estimation Of Parameters" Advances in Nips (ed. D.s. Toniesky), Morgan Kaufmann Publ., pp.211-217,1990.
[19]	G. Zavaliagkos, Y. Zhao, R. Schwartz and J Makhoul, "A Hybrid Segmental Neural Net/Hidden Markov Model System for Continuous Speech Recognition," IEEE Transactions on Speech and Audio Processing, vol. 2, no. 1, pp. 151-160, 1994.
[20]	X. Driancourt, L. Bottou, and P. Gallinari, "Learning Vector Quantization Multilayer Perceptron and Dynamic Programming: Comparison and Cooperation," in Proceedings of the International Joint Conference on Neural Networks, IJCNN, vol. 2, pp. 815-819, 1991.
[21]	J. Tebelskis, A. Waibel, B.Petek, O.Schmidbauer, "Continuous Speech Recognition Using Linkeed Predictive Network" Advances in Neural Information Processing Systems 3, Eds Lippman, Moody and Touretsky, Publ. Morgan Kaufman, pp.199-205,1991.
[22]	M .Franzini, K.F. Lee , and A. Waibel, "Connectionist Viterbi Training: A New Hybrid Method for Continuous Speech Recognition," in Proceedings of ICASSP, Albuquerque, NM, pp. 425-428, 1990.
[23]	L. T. Niles , H. F. Silverman, "Combining Hidden Markov Models and Neural Networks classifiers," in Proceedings ICASSP, pp. 417- 420, Albuquerque, NM, 1990.
[24]	E. Levin, "Word Recognition Using Hidden Control Neural Architecture," in Proceedings ICASSP, Albuquerque, NM, pp. 433-436, 1990.
[25]	R. Djemili, M. Bedda, H. Bourouba "Recognition Of Spoken Arabic Digits Using Neural Predictive Hidden Markov Models" International Arab Journal on Information Technology, IAJIT, Vol.1, N°2, pp. 226-233, July 2004.
[26]	M. N. Do and M. Wagner, "Speaker recognition with small training requirements using a combination of VQ and DHMM" , Proc. of Speaker Recognition an
JOŽEF STEFAN INSTITUTE
Jožef Stefan (1835-1893) was one of the most prominent physicists of the 19th century. Born to Slovene parents, he obtained his Ph.D. at Vienna University, where he was later Director of the Physics Institute, Vice-President of the Vienna Academy of Sciences and a member of several scientific institutions in Europe. Stefan explored many areas in hydrodynamics, optics, acoustics, electricity, magnetism and the kinetic theory of gases. Among other things, he originated the law that the total radiation from a black body is proportional to the 4th power of its absolute temperature, known as the Stefan-Boltzmann law.
The Jožef Stefan Institute (JSI) is the leading independent scientific research institution in Slovenia, covering a broad spectrum of fundamental and applied research in the fields of physics, chemistry and biochemistry, electronics and information science, nuclear science technology, energy research and environmental science.
The Jožef Stefan Institute (JSI) is a research organisation for pure and applied research in the natural sciences and technology. Both are closely interconnected in research departments composed of different task teams. Emphasis in basic research is given to the development and education of young scientists, while applied research and development serve for the transfer of advanced knowledge, contributing to the development of the national economy and society in general.
At present the Institute, with a total of about 700 staff, has 500 researchers, about 250 of whom are postgraduates, over 200 of whom have doctorates (Ph.D.), and around 150 of whom have permanent professorships or temporary teaching assignments at the Universities.
In view of its activities and status, the JSI plays the role of a national institute, complementing the role of the universities and bridging the gap between basic science and applications.
Research at the JSI includes the following major fields: physics; chemistry; electronics, informatics and computer sciences; biochemistry; ecology; reactor technology; applied mathematics. Most of the activities are more or less closely connected to information sciences, in particular computer sciences, artificial intelligence, language and speech technologies, computer-aided design, computer architectures, biocybernetics and robotics, computer automation and control, professional electronics, digital communications and networks, and applied mathematics.
ranean Europe, offering excellent productive capabilities and solid business opportunities, with strong international connections. Ljubljana is connected to important centers such as Prague, Budapest, Vienna, Zagreb, Milan, Rome, Monaco, Nice, Bern and Munich, all within a radius of 600 km.
In the last year on the site of the Jožef Stefan Institute, the Technology park "Ljubljana" has been proposed as part of the national strategy for technological development to foster synergies between research and industry, to promote joint ventures between university bodies, research institutes and innovative industry, to act as an incubator for high-tech initiatives and to accelerate the development cycle of innovative products.
At the present time, part of the Institute is being reorganized into several high-tech units supported by and connected within the Technology park at the Jožef Stefan Institute, established as the beginning of a regional Technology park "Ljubljana". The project is being developed at a particularly historical moment, characterized by the process of state reorganisation, privatisation and private initiative. The national Technology Park will take the form of a shareholding company and will host an independent venture-capital institution.
The promoters and operational entities of the project are the Republic of Slovenia, Ministry of Science and Technology and the Jožef Stefan Institute. The framework of the operation also includes the University of Ljubljana, the National Institute of Chemistry, the Institute for Electronics and Vacuum Technology and the Institute for Materials and Construction Research among others. In addition, the project is supported by the Ministry of Economic Relations and Development, the National Chamber of Economy and the City of Ljubljana.
Jožef Stefan Institute
Jamova 39, 1000 Ljubljana, Slovenia
Tel.:+386 1 4773 900, Fax.:+386 1 219 385
Tlx.:31 296 JOSTIN SI
WWW: http://www.ijs.si
E-mail: matjaz.gams@ijs.si
Contact person for the Park: Iztok Lesjak, M.Sc.
Public relations: Natalija Polenec
The Institute is located in Ljubljana, the capital of the independent state of Slovenia (or S9nia). The capital today is considered a crossroad between East, West and Mediter-
INFORMATICA
AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS
INVITATION, COOPERATION
Submissions and Refereeing
Please submit three copies of the manuscript with good copies of the figures and photographs to one of the editors from the Editorial Board or to the Contact Person. At least two referees outside the author's country will examine it, and they are invited to make as many remarks as possible directly on the manuscript, from typing errors to global philosophical disagreements. The chosen editor will send the author copies with remarks. If the paper is accepted, the editor will also send copies to the Contact Person. The Executive Board will inform the author that the paper has been accepted, in which case it will be published within one year of receipt of e-mails with the text in Informatica LATEX format and figures in .eps format. The original figures can also be sent on separate sheets. Style and examples of papers can be obtained by e-mail from the Contact Person or from FTP or WWW (see the last page of Informatica).
Opinions, news, calls for conferences, calls for papers, etc. should be sent directly to the Contact Person.
QUESTIONNAIRE
Send Informatica free of charge
Yes, we subscribe
Please, complete the order form and send it to Dr. Drago Torkar, Informatica, Institut Jožef Stefan, Jamova 39, 1111 Ljubljana, Slovenia.
Since 1977, Informatica has been a major Slovenian scientific journal of computing and informatics, including telecommunications, automation and other related areas. In its 16th year (more than ten years ago) it became truly international, although it still remains connected to Central Europe. The basic aim of Informatica is to impose intellectual values (science, engineering) in a distributed organisation.
Informatica is a journal primarily covering the European computer science and informatics community - scientific and educational as well as technical, commercial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international referee-ing. It publishes scientific papers accepted by at least two referees outside the author's country. In addition, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and information industry are presented through commercial publications as well as through independent evaluations.
Editing and refereeing are distributed. Each editor can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author's country. If new referees are appointed, their names will appear in the Refereeing Board.
Informatica is free of charge for major scientific, educational and governmental institutions. Others should subscribe (see the last page of Informatica).
ORDER FORM - INFORMATICA
Name: ...............................
Title and Profession (optional): .........
Home Address and Telephone (optional):
Office Address and Telephone (optional):
E-mail Address (optional): .............
Signature and Date: ...................
Informatica WWW: http://www.informatica.si/
Referees:
Witold Abramowicz, David Abramson, Adel Adi, Kenneth Aizawa, Suad Alagić, Mohamad Alam, Dia Ali, Alan Aliu, Richard Amoroso, John Anderson, Hans-Jurgen Appelrath, Ivän Araujo, Vladimir BajiC, Michel Barbeau, Grzegorz Bartoszewicz, Catriel Beeri, Daniel Beech, Fevzi Belli, Simon Beloglavec, Sondes Bennasri, Francesco Bergadano, Istvan Berkeley, Azer Bestavros, Andraž Bežek, Balaji Bharadwaj, Ralph Bisland, Jacek Blazewicz, Laszlo Boeszoermenyi, Damjan Bojadžijev, Jeff Bone, Ivan Bratko, Pavel Brazdil, Bostjan Brumen, Jerzy Brzezinski, Marian Bubak, Davide Bugali, Troy Bull, Sabin Corneliu Buraga, Leslie Burkholder, Frada Burstein, Wojciech Buszkowski, Rajkumar Bvyya, Giacomo Cabri, Netiva Caftori, Particia Carando, Robert Cattral, Jason Ceddia, Ryszard Choras, Wojciech Cellary, Wojciech Chybowski, Andrzej Ciepielewski, Vic Ciesielski, Mel Ó Cinnéide, David Cliff, Maria Cobb, Jean-Pierre Corriveau, Travis Craig, Noel Craske, Matthew Crocker, Tadeusz Czachorski, Milan CCeška, Honghua Dai, Bart de Decker, Deborah Dent, Andrej Dobnikar, Sait Dogru, Peter Dolog, Georg Dorfner, Ludoslaw Drelichowski, Matija Drobnic, Maciej Drozdowski, Marek Druzdzel, Marjan Družovec, Jozo Dujmovic, Pavol iDuriš, Amnon Eden, Johann Eder, Hesham El-Rewini, Darrell Ferguson, Warren Fergusson, David Flater, Pierre Flener, Wojciech Fliegner, Vladimir A. Fomichov, Terrence Forgarty, Hans Fraaije, Stan Franklin, Violetta Galant, Hugo de Garis, Eugeniusz Gatnar, Grant Gayed, James Geller, Michael Georgiopolus, Michael Gertz, Jan Golinski, Janusz Gorski, Georg Gottlob, David Green, Herbert Groiss, Jozsef Gyorkos, Marten Haglind, Abdelwahab Hamou-Lhadj, Inman Harvey, Jaak Henno, Marjan Hericko, Henry Hexmoor, Elke Hochmueller, Jack Hodges, John-Paul Hosom, Doug Howe, Rod Howell, Tomdš Hruška, Don Huch, Simone Fischer-Huebner, Zbigniew Huzar, Alexey Ippa, Hannu Jaakkola, Sushil Jajodia, Ryszard Jakubowski, Piotr Jedrzejowicz, A. Milton Jenkins, Eric Johnson, Polina Jordanova, Djani Juricic, Marko Juvancic, Sabhash Kak, Li-Shan Kang, Ivan Kapust0k, Orlando Karam, Roland Kaschek, Jacek Kierzenka, Jan Kniat, Stavros Kokkotos, Fabio Kon, Kevin Korb, Gilad Koren, Andrej Krajnc, Henryk Krawczyk, Ben Kroese, Zbyszko Krolikowski, Benjamin Kuipers, Matjaž Kukar, Aarre Laakso, Sofiane Labidi, Les Labuschagne, Ivan Lah, Phil Laplante, Bud Lawson, Herbert Leitold, Ulrike Leopold-Wildburger, Timothy C. Lethbridge, Joseph Y-T. Leung, Barry Levine, Xuefeng Li, Alexander Linkevich, Raymond Lister, Doug Locke, Peter Lockeman, Vincenzo Loia, Matija Lokar, Jason Lowder, Kim Teng Lua, Ann Macintosh, Bernardo Magnini, Andrzej Malachowski, Peter Marcer, Andrzej Marciniak, Witold Marciszewski, Vladimir Marik, Jacek Martinek, Tomasz Maruszewski, Florian Matthes, Daniel Memmi, Timothy Menzies, Dieter Merkl, Zbigniew Michalewicz, Armin R. Mikler, Gautam Mitra, Roland Mittermeir, Madhav Moganti, Reinhard Moller, Tadeusz Morzy, Daniel Mossé, John Mueller, Jari Multisilta, Hari Narayanan, Jerzy Nawrocki, Rance Necaise, Elzbieta Niedzielska, Marian Niedq'zwiedzinski, Jaroslav Nieplocha, Oscar Nierstrasz, Roumen Nikolov, Mark Nissen, Jerzy Nogiec, Stefano Nolfi, Franc Novak, Antoni Nowakowski, Adam Nowicki, Tadeusz Nowicki, Daniel Olejar, Hubert Österle, Wojciech Olejniczak, Jerzy Olszewski, Cherry Owen, Mieczyslaw Owoc, Tadeusz Pankowski, Jens Penberg, William C. Perkins, Warren Persons, Mitja Peruš, Fred Petry, Stephen Pike, Niki Pissinou, Aleksander Pivk, Ullin Place, Peter Planinšec, Gabika Polcicovä, Gustav Pomberger, James Pomykalski, Tomas E. Potok, Dimithu Prasanna, Gary Preckshot, Dejan Rakovic, Cveta Razdevšek Pucko, Ke Qiu, Michael Quinn, Gerald Quirchmayer, Vojislav D. Radonjic, Luc de Raedt, Ewaryst Rafajlowicz, Sita Ramakrishnan, Kai Rannenberg, Wolf Rauch, Peter Rechenberg, Felix Redmill, James Edward Ries, David Robertson, Marko Robnik, Colette Rolland, Wilhelm Rossak, Ingrid Russel, A.S.M. Sajeev, Kimmo Salmenjoki, Pierangela Samarati, Bo Sanden, P. G. Sarang, Vivek Sarin, Iztok Savnik, Ichiro Satoh, Walter Schempp, Wolfgang Schreiner, Guenter Schmidt, Heinz Schmidt, Dennis Sewer, Zhongzhi Shi, Märia Smolärovä, Carine Souveyet, William Spears, Hartmut Stadtler, Stanislaw Stanek, Olivero Stock, Janusz Stoklosa, Przemyslaw Stpiczynski, Andrej Stritar, Maciej Stroinski, Leon Strous, Ron Sun, Tomasz Szmuc, Zdzislaw Szyjewski, Jure Šilc, Metod Škarja, Jiri Šlechta, Chew Lim Tan, Zahir Tari, Jurij Tasic, Gheorge Tecuci, Piotr Teczynski, Stephanie Teufel, Ken Tindell, A Min Tjoa, Drago Torkar, Vladimir Tosic, Wieslaw Traczyk, Denis Trcek, Roman Trobec, Marek Tudruj, Andrej Ule, Amjad Umar, Andrzej Urbanski, Marko Uršic, Tadeusz Usowicz, Romana Vajde Horvat, Elisabeth Valentine, Kanonkluk Vanapipat, Alexander P. Vazhenin, Jan Verschuren, Zygmunt Vetulani, Olivier de Vel, Didier Vojtisek, Valentino Vranic, Jozef Vyskoc, Eugene Wallingford, Matthew Warren, John Weckert, Michael Weiss, Tatjana Welzer, Lee White, Gerhard Widmer, Stefan Wrobel, Stanislaw Wrycza, Tatyana Yakhno, Janusz Zalewski, Damir Zazula, Yanchun Zhang, Ales Zivkovic, Zonling Zhou, Robert Zorc, Anton P. Železnikar
Informatica
An International Journal of Computing and Informatics
Archive of abstracts may be accessed at America: http://ocean.ocean.cs.siu.edu/informatica/index.html, Europe: http://www.informatica.si/, Asia: http://www3.it.deakin.edu.au/ hdai/Informatica/.
Subscription Information Informatica (ISSN 0350-5596) is published four times a year in Spring, Summer, Autumn, and Winter (4 issues per year) by the Slovene Society Informatika, Vožarski pot 12, 1000 Ljubljana, Slovenia.
The subscription rate for 2006 (Volume 30) is
-	60 EUR (80 USD) for institutions,
-	30 EUR (40 USD) for individuals, and
-	15 EUR (20 USD) for students
Claims for missing issues will be honored free of charge within six months after the publication date of the issue.
Typesetting: Borut Žnidar.
Printed by Dikplast Kregar Ivan s.p., Kotna ulica 5, 3000 Celje.
Orders for subscription may be placed by telephone or fax using any major credit card. Please call Mr. Drago Torkar, Jožef Stefan Institute: Tel (+386) 1 4773 900, Fax (+386) 1 219 385, or send checks or VISA card number or use the bank account number 900-27620-5159/4 Nova Ljubljanska Banka d.d. Slovenia (LB 50101-678-51841 for domestic subscribers only).
Informatica is published in cooperation with the following societies (and contact persons): Robotics Society of Slovenia (Jadran Lenarcic) Slovene Society for Pattern Recognition (Franjo Pernuš)
Slovenian Artificial Intelligence Society; Cognitive Science Society (Matjaž Gams) Slovenian Society of Mathematicians, Physicists and Astronomers (Bojan Mohar) Automatic Control Society of Slovenia (Borut Zupancic)
Slovenian Association of Technical and Natural Sciences / Engineering Academy of Slovenia (Igor Grabec) ACM Slovenia (Dunja Mladenic)
Informatica is surveyed by: AI and Robotic Abstracts, AI References, ACM Computing Surveys, ACM Digital Library, Applied Science & Techn. Index, COMPENDEX*PLUS, Computer ASAP, Computer Literature Index, Cur. Cont. & Comp. & Math. Sear., Current Mathematical Publications, Cybernetica Newsletter, DBLP Computer Science Bibliography, Engineering Index, INSPEC, Linguistics and Language Behaviour Abstracts, Mathematical Reviews, MathSci, Sociological Abstracts, Uncover, Zentralblatt für Mathematik
The issuing of the Informatica journal is financially supported by the Ministry of Higher Education, Science and Technology, Trg OF 13, 1000 Ljubljana, Slovenia.
Informatica
An International Journal of Computing and Informatics
Introduction
Deterministic Soliton Graphs Expected Case for Projecting Points
The spectra of Knödel graphs
On the Crossing Number of Almost Planar Graphs
End of special section / Start of normal papers_
A.	Brodnik	279 M. Bartha, M. Krész	281 S. Cabello, M. DeVos,	289
B.	Mohar
H.A. Harutyunyan,	295
C.D.	Morosan
B. Mohar	301
Unsupervised Feature Extraction for Time Series Clustering Using Orthogonal Wavelet Transform Coloring Weighted Series-Parallel Graphs Credit Classification Using Grammatical Evolution
Integration of Bargaining into E-Business Systems
An Overview of Multimedia Content-Based Retrieval Strategies
A Three-Phase Algorithm for Computer Aided siRNA Design
A Study of Fairness of Information Distribution and Utilization in a Mobile Agent-Based Adaptive Information Service System
Isolated Words Recognition System Based on Hybrid Approach DTW/GHMM
H. Zhang, T.B. Ho, 305 Y. Zhang, M.-S. Lin
G.	Fijavž	321
A.	Brabazon,	325 M. O'Neill
H.C.	Mayr,	335 K.-D. Schewe,
B.	Thalheim, T. Welzer
A. Mittal	347
H.	Zhou, X. Zeng, 357 Y. Wang, B.R. Seyfarth
I.	Ahmed, M.J. Sadiq 365
E-H. Bourouba,	373
M. Bedda, R. Djemili