0 rH 1 m O O cö a of the 2010 Mini-Conference on Applied Theoretical Computer Science MATCOS-10 Proceedings of the 2010 Mini-Conference on Applied Theoretical Computer Science Edited by Andrej Brodnik and Gabor Galambos Reviewers and Programme Committee Andrej Brodnik, University of Primorska and University of Ljubljana, Slovenia (co-chair) Gabor Galambos, University of Szeged, Hungary (co-chair) Gabriel Istrate, Universitatea Babes Bolyai, Cluj, Romania Miklos Kresz, University of Szeged, Hungary Gerhard Reinelt, Ruprecht-Karls-Universität, Heidelberg, Germany Borut Robič, University of Ljubljana, Slovenia Magnus Steinby, University of Turku, Finland Borut Žalik, University of Maribor, Slovenia Organizing Committee David Paš, chair Iztok Cukjati Milan Djordjevic Andrej Kramar Tine Šukljan m j? & t/NiV^ Published by University of Primorska Press Titov trg 4, 6000 Koper Koper 2011 Editor in Chief Jonatan Vinkler Managing Editor Alen Ježovnik © 2011 University of Primorska Press CIP — Kataložni zapis o publikaciji Narodna in univerzitetna knjižnica, Ljubljana 004(082)(0.034.2) MINI-Conference on Applied Theoretical Computer Science (2010 ; Koper) MATC0S-10 [Elektronski vir] : proceedings of the 2010 Mini-Conference on Applied Theoretical Computer Science / edited by Andrej Brodnik and Gabor Galambos. — El. knjiga. — Koper : University of Primorska Press, 2011 Način dostopa (URL): http://www.hippocampus.si/ISBN/978-961-6832-10-6.pdf Preface The collaboration between the University of Primorska and the University of Szeged retrospect only a short period. Even so, there is a couple of research interest where the two groups can collaborate. The MATCOS-10 (Mini-Conference on Applied Theoretical Computer Science) conference started from the idea that we collect those results, which we have in our joint projects. We wanted to organize a miniconference where we expected papers from several streams of operation research a broader. We invited papers, which are dealing with those methods from the theoretical computer science, which have been integrated into real world applications. Practical solutions for NP-hard problems, algorithmic oriented AI and data mining solutions, new models and methods in system biology and bioinformatics, automata theory solutions in software and hardware verification, prospective new models of computation and future computer architecture are those topics which were preferred in the picking. We had also a second aim with this conference: we wanted to give a chance for the young researchers to lecture on their results. Therefore we organized a special Student Session. We encouraged regular and PhD students to take part on this conference in order to get the first impression about the flavour of a conference. The MATCOS-10 conference was held on October 1314, 2010 at the University of Primorska in Koper, in conjunction with the 13th Multi-Conference on Information Society, October 11-15, 2010, Ljubljana, Slovenia. In response of call of papers we received 38 submissions. Each submission was reviewed by at least two referees. Based on the reviews, the Program Committee selected 23 papers for presentation in these proceedings. In addition to the selected papers, Andras Recski from the TU Budapest gave an invited talk on 'Applications of Combinatorics in Statics.' We used the EasyChair system to manage the submissions. Thanks to David Paš, chair of the Organizing Committee for maintaining the on-line background of the conference. We are also grateful to Miklos Kresz, who - besides giving a great activity in the Organizing Committee - managed the homepage of the conference. Having elated from the success of the MATCOS-10 we decided to organize the conference regularly. See you on the next MATCOS conference in Szeged! Andrej Brodnik and Gabor Galambos Invited Paper Rigidity of Frameworks: Applications of Combinatorial Optimization to Statics Andras Recski ■ 1 Student Papers Succinct Data Structures for Dynamic Strings Tine Sukljan ■ 5 Propagation Algorithms for Protein Classification Andras Bota ■ 9 Optimal and Reliable Covering of Planar Objects with Circles Endre Palatinus ■ 15 Truth-Teller-Liar Puzzles - A Genetic Algorithm Approach (with Tuning and Statistics) Attila Tanyi ■ 19 Heuristics for the Multiple Depot Vehicle Scheduling Problem Balazs David ■ 23 Bayesian Games on a Maxmin Network Router Grigorios G. Anagnostopoulos ■ 29 An Extensible Probe Architecture for Network Protocol Performance Measurement David Božič ■ 35 Detection and Visualization of Visible Surfaces Danijel Žlaus ■ 41 Efficient Approach for Visualization of Large Cosmological Particle Datasets Niko Lukač ■ 45 Regular Papers Computing the Longest Common Subsequence of Two Strings When One of Them is Run-Length Encoded Shegufta Bakht Ahsan, Tanaeem M. Moosa, and M. Sohel Rahman ■ 49 Prefix Transpositions on Binary and Ternary Strings Amit Kumar Dutta, Masud Hasan, and M. Sohel Rahman ■ 53 Embedding of Complete and Nearly Complete Binary Trees into Hypercubes Aleksander Vesel ■ 57 Information Set-Distance Joel Ratsaby ■ 61 Better Bounds for the Bin Packing Problem with the 'Largest Item in the Bottom' Constraint György Dosa, Zsolt Tuza, and Deshi Ye ■ 65 Semi-on-Line Bin Packing: An Overview and an Improved Lower Bound Janos Balogh and Jozsef Bekesi ■ 69 Determining the Expected Runtime of an Exact Graph Coloring Algorithm Zoltan Adam Mann and Aniko Szajko ■ 75 Speeding up Exact Cover Algorithms by Preprocessing and Parallel Computation Sandor Szabo ■ 85 Community Detection and Its Use in Real Graphs Andras Bota, Laszlo Csizmadia, and Andras Pluhar ■ 95 Greedy Heuristics for Driver Scheduling and Rostering Viktor Argilan, Csaba Kemeny, Gabor Pongracz, Attila Toth, and Balazs David ■ 101 A Note on Context-Free Grammars with Rewriting Restrictions Zsolt Gazdag ■ 109 Using Multigraphs in the Shallow Transfer Machine Translation Jernej Vičič ■ 113 Model Checking of the Slotted CSMA/CA MAC Protocol of the IEEE 802.15.4 Standard Zoltan L. Nemeth ■ 121 Experiment-Based Definitions for Electronic Exam Systems Andrea Huszti 127 matcos-io Rigidity of Frameworks: Applications of Combinatorial Optimization to Statics Andräs Recski Budapest University of Technology and Economics H-1521 Budapest, Hungary +36-1-4632585 recski@cs.bme.hu ABSTRACT The survey presents various applications of graphs and matroids to the rigidity of classical and tensegrity frameworks. Categories and Subject Descriptors F.2.2 Computations on Discrete Structures General Terms Algorithms Keywords Graphs, matroids, frameworks, rigidity, structural topology 1. INTRODUCTION Apart from the last section, we shall consider only frameworks, composed of rigid rods and rotatable joints. Such a framework may be rigid or nonrigid, see Figure 1. Rigidity depends on the dimension of the space as well: the framework of Figure 2 is rigid in the plane but not in the space. A nonrigid framework need not necessarily be a "mechanism", rigidity also excludes "infinitesimal" motions like those at Figure 3. Hence our "rigidity" is called "infinitesimal rigidity" by most authors. Figure 1 Figure 2 Figure 3 2. RIGIDITY OF A GIVEN FRAMEWORK Deciding rigidity of a framework F can be formulated as a problem of linear algebra. Rigidity of a rod ej, connecting joints vi and Vj with respective coordinates (xi, y,...) and (xj, yj,....) means that (xi - Xj)2 + (y - y) + ... = constant. Its derivative is (xi - xj)dxi/dt + (xj - xi)dxj/dxt +(yi - jy^dy/dt + (yj - y)dyj/dt + ... = 0. All these equations together give AF u=0, where the number of rows of the matrix AF is the number e of the rods, the number of columns of AF is the number v of the joints multiplied by the dimension d of the space and u^dx^/dt, dx2/dt, ....., dy/dt, dy/dt, .....)T. The congruent motions of the d-dimensional space (which form a subspace of dimension d(d+1)/2 ) are all the solutions of this system. A framework F will be called rigid in the d-dimensional space if there are no other solutions, that is, if the rank of AF is vd - d(d+1)/2. 3. DIFFERENT FRAMEWORKS WITH ISOMORPHIC GRAPHS A framework F can be described by a graph G(F) with v vertices and e edges. However, one of the planar frameworks of Figure 4 and one of the 3-dimensional frameworks of Figure 5 are rigid, the other two are not, showing that G(F) alone determines the zero-nonzero pattern of the matrix AF only, but not its rank, hence not the rigidity of F either. Clearly, if two determinants have the same zero-nonzero pattern, one of them may vanish if certain entries cancel out each other. Figure 4 Figure 5 4. WHAT CAN COMBINATORIALISTS DO HERE? While the examples of Section 3 show that one cannot expect to decide rigidity from the graph alone, combinatorial methods can still help in two cases: either if the framework is "very symmetric", see Sections 7-8 below, or if it has "no regularity at all". This latter means that no cancellation may arise among the entries of AF unless it is reflected by the structure of G(F). Formally we call a graph G (not the framework!) generic rigid in dimension d if there exists a rigid d-dimensional framework F with G(F)=G. Alternatively, G is generic rigid in dimension d if and only if realizing it in the d-dimensional space so that the coordinates of the joints are algebraically independent over the field of the rational numbers, the resulting framework is rigid. For example, the graphs of the frameworks of Figures 4 and 5 are generic rigid (in 2- and 3-dimension, respectively) while the graph of the second framework of Figure 1 is not. 5. GENERIC RIGID GRAPHS IN THE PLANE Deciding generic rigidity of any graph in any dimension is in NP (if there are rigid realizations, there are some with "small" integer coordinates as well). The problem is in P for dimension 2 and its status is unknown for dimension > 3. For simplicity we restrict ourselves to minimum generic rigid graphs in dimension 2. Then, with the notation of Section 2, e = 2v - 3 and clearly e' < 2v' - 3 for any "subsystem" with e' rods among v' joints to avoid situations like that of Figure 6 where a part of the system is "overbraced". This stronger necessary condition, known to Maxwell [14] already in 1864, was proved to be sufficient (Laman [12]). This leads to a polynomial algorithm by observing (Lovasz and Yemini, [13]) that the necessary and sufficient condition was equivalent to the property that doubling any edge of the graph the resulting graph with 2v-2 edges is the union of two trees. This condition can be checked in polynomial time, using the matroid partition algorithm of Edmonds [9]. The analogue of Maxwell's condition in dimension 3 can also be reformulated with tree decompositions (Recski [15]) but this is only necessary. Figure 7 shows the smallest counterexample. KD Figure 6 Figure 7 6. RECONSTRUCTION OF POLYHEDRA Apart from obvious applicability in structural engineering, rigidity has an interesting relation to pattern reconstruction. Suppose we wish to decide from a 2-dimensional drawing whether it is the projection of a polyhedron. The reason of a negative answer may follow from the graph of the drawing already (drawings like those in Figure 8 are well known) but sometimes the answer is positive only if some additional conditions are met (like the three dotted lines meet at the same point in Figure 9). Under some conditions a drawing arises as the projection of a 3-dimensional polyhedron if and only if it is nonrigid as a 2-dimensional framework. For example, Figure 10 shows that the second framework of Figure 4 is rigid while the first one is not Figure 10 7. RIGIDITY OF SQUARE GRIDS Another area of combinatorial applications arises if we restrict our attention to "very regular" frameworks. For example, the k X l square grid as a planar framework requires at least k + l - 1 diagonal rods to become rigid and the minimum systems of such diagonals correspond to the spanning trees of the complete bipartite graph Kkl with k + l vertices (Bolker and Crapo, [5]). For example, Figure 11 shows a 3 X 3 grid with a possible deformation - the corresponding graph is disconnected. Figure 8 Figure 9 Figure 11 8. RIGIDITY OF A 1-STORY BUILDING Consider a 1 -story building, with the vertical bars fixed to the earth via joints. If each of the four external vertical walls consists of a diagonal then the four corners of the roof become fixed. Hence questions related to the rigidity of a one-story building are reduced (Bolker and Crapo, [5]) to those related to the rigidity of a 2-dimensional square grid of size k X l where the corners are pinned down. Then the minimum number of necessary diagonal rods for infinitesimal rigidity was proved to be k + l - 2 (Bolker and Crapo, [5]) and the minimum systems correspond to asymmetric 2-component forests (Crapo, [7]). A 2-component forest is asymmetric if the ratio of the cardinalities of the bipartition sets in the two components is different from k:l. For example, parts (a) and (b) of Figure 12 show two systems with the corresponding 2-component forests. The second system is nonrigid, an infinitesimal deformation is shown in part (c) of the figure. Observe that the points u, v, w are collinear in Figure 12(b), this can always be achieved after appropriate permutations of the row set and of the column set, if the 2-component forest is symmetric, u 7 —II / / / / V / / / / II— / / / / / / V / / / / / / / / (a) (b) 9. TENSEGRITY FRAMEWORKS Since real life rods are less reliable against compression than against tension, we might prefer using more diagonal rods but under tension only. In order to study such questions, the concept of tensegrity frameworks has been introduced: the elements connecting the joints may be rods (which are rigid both under tension and compression), cables (used under tension only) and possibly struts as well (used under compression only). Then the diagonal tensegrity elements will correspond to directed edges and a collection of such elements will rigidify the square grid if and only if the corresponding subgraph is strongly connected (Baglivo and Graver, [2]), The corresponding questions for 1-story buildings are more complicated and need various tools of matching theory and network flow techniques (Chakravarty et al, [6], Recski and Schwärzler, [20]). More recently, some of the results described in Section 5 have also been generalized for tensegrity frameworks (Jordan et al,.[11], Recski, [21], Recski and Shai, [22]). 10. ACKNOWLEDGEMENTS The support of the Hungarian National Science Foundation and the National Office for Research and Technology (Grant number OTKA 67651) is gratefully acknowledged. 11. REFERENCES [1] L. Asimow and B. Roth, 1978-79. The rigidity of graphs, Trans. Amer. Math. Soc. 245, 279-289 (Part I) and J. Math. Anal. Appl. 68, 171-190 (Part II). [2] J. A. Baglivo and J. E. Graver, 1983. Incidence and symmetry in design and architecture, Cambridge University Press, Cambridge . [3] E. Bolker, 1977. Bracing grids of cubes, Env. Plan. B, 4, 157-172. 4] E. Bolker, 1979. Bracing rectangular frameworks II. SIAM J. Appl. Math. 36, 491-508. 5] E. Bolker and H. Crapo, 1 979. Bracing rectangular frameworks, SIAM J. Appl. Math. 36, 473-490. 6] N. Chakravarty, G. Holman, S. McGuinness and A. Recski, 1986. One-story buildings as tensegrity frameworks, Struct. Topology, 12, 11-18. 7] H. Crapo, 1977. More on the bracing of one-story buildings, Env. Plan. B, 4, 153-156. 8] H. Crapo, 1979. Structural rigidity, Struct. Topology 1, 2645. 9] J. Edmonds, 1968. Matroid partition, Math. of the Decision Sciences, Part I., Lectures in Appl. Math., 11, 335-345. 10] G. Hetyei, 1964. On covering by 2X1 rectangles, Pecsi Tanarkepzo F disk. Közl. 151-168. 11] T. Jordan, A. Recski and Z. Szabadka, 2009. Rigid tensegrity labelling of graphs, European J. of Combinatorics 30, 18871895. 12] G. Laman, 1970. On graphs and rigidity of plane skeletal structures, J. of Eng. Math. 4, 331-340. 13] L. Lovasz and Y. Yemini, 1982. On generic rigidity in the plane, SIAM J. Alg. Discrete Methods, 3, 91-98. 14] J. C. Maxwell, 1864. On the calculation of the equilibrum and stiffness of frames, Philos. Mag. (4), 27, 294-299. 1 5] A. Recski, 1984. A network theory approach to the rigidity of skeletal structures II. Laman's theorem and topological formulae, Discrete Applied Math. 8, 63-68. 16] A. Recski, 1987. Elementary strong maps of graphic matroids, Graphs and Combinatorics, 3, 379-382. 17] A. Recski, 1988/89. Bracing cubic grids - a necessary condition, Discrete Math., 73, 199-206. 18] A. Recski, 1989. Matroid theory and its applications in electric network theory and in statics, Springer, Berlin-Heidelberg-New York; Akademiai Kiado, Budapest. 19] A. Recski, 1991. One-story buildings as tensegrity frameworks II, Struct. Topology 17, 43-52. 20] A. Recski and W. Schwärzler, 1992. One-story buildings as tensegrity frameworks III, Discrete Applied Math., 39, 137146. 21] A. Recski, 2008. Combinatorial conditions for the rigidity of tensegrity frameworks, Bolyai Society Mathematical Studies 17, 163-177. 22] A. Recski and O. Shai, 2010.Tensegrity frameworks in the one-dimensional space, European Journal of Combinatorics 31, 1072-1079. 23] B. Roth and W- Whiteley, 1981. Tensegrity frameworks, Trans. Amer. Math. Soc. 265, 419-446. 24] T. Tarnai, 1980. Simultaneous static and kinematic indeterminacy of space trusses with cyclic symmetry, Int. J. Solids and Structures, 16, 347-359. 25] W. Whiteley, 1988. The union of matroids and the rigidity of frameworks, SIAM J. Discrete Math. 9, 237-255. Succinct data structures for dynamic strings Tine Sukljan FAMNIT Univerza na Primorskem Koper, Slovenia tine.sukljan@upr.si ABSTRACT The biggest problem of full-text indexes is their space consumption. In the last few years many efforts has been put in the research of these indexes. At first the idea was to exploit the compressibility of the text, so that the size of the index would be more efficient. Recently this idea even more evolved in the so called self-indexes, which stores enough information to replace the original text. This idea of an index which takes space close to that of the compressed text, replaces it and provides fast search over it was the source for many new research done in the last few years. In this article we present the most recent result done in this area of research. 1. INTRODUCTION Succinct data structures provide solutions to reduce the storage cost of modern applications that process large data sets, such as web search engines, geographic information systems, and bioinformatics applications. First proposed by Jacobson [8], the aim is to encode a data structure using space close to the information-theoretic lower bound, while supporting efficient navigation operations in them. We are given a sequence S of length n over an alphabet £ of size a. We can use operation access(S, i) to read a symbol at position i of the sequence S (0 < i < n). Given the specified parameters we want to support two more query operation for a symbol c € £ [3]: • selectc(S, j): return the position of the jth occurrence of symbol c, or —1 if that occurence does not exists; • rankc(S,p): count how many occurrences are found in S[0,p — 1]. In many situations only retrieving data (even though efficiently) is not enough, as data in these situations are up- dated frequently. In this case two additional operations are desired: • insertc(S, i): insert character c between S[i — 1] and S [i]; • delete(S,i): delete S[i] from S. 2. RANK AND SELECT OPERATIONS ON DYNAMIC COMPRESSED SEQUENCES Gonzalez et al. in [3] presented a data structure with nH0 + o(n log a) bits of space and O(log n(1 + n)) worst-case time. We first present a data structure for the Collection of Searchable Partial Sums with Indels (CSPSI) problem, which will be used later in the article. The problem consists of a sequences C = S1 ,...,Si of nonnegative integers sjq y; • update(C, j, i, x) updates sj to sj + x; • insert(C, i) inserts 0 between sj_ 1 and sj for all 1 < j < a; • delete(C, i) deletes sj from the sequence Sj for all 1 < j < a. To perform delete(C, i) it must hold sj = 0 for all 1 < j < a. We now present the data structure that Gonzalez et al. introduced in [3]. They construct a red-black tree over C. Each leaf contains a non-empty superblock of size from 1 log2 n to 2log n bits. The leftmost leaf contains si ...s^-. si ...s^i ... si ...s^ , the second s^+i . ..s^ s21 +i . ..s22 ...s£1+i . ..s^ and so on. The size of the leaves are variable but bounded. bi,b2,... are such that | log2 n < akbi ,ak(b2 — bi),... < 2log2 n. Each internal node v stores two additional type of counters, p(v) and rj (v)i pos we enter the left subtree, otherwise we enter the left subtree, otherwise we enter the right subtree with sb ^ sb + r(v) and pos ^ pos — p(v). We reach the leaf that contains the i-th position in O (log n) time. Then we directly access the pos-th symbol of superblock sb. insertc(S,i): We obtain sb and pos just like in the access query, except that we start with pos ^ i — 1, so as to insert right after position i — 1. Then, if superblock sb contains room for one more symbol, we insert c right after the pos-th position of sb, by shifting the symbols through the blocks as explained. If the insertion causes an overflow in the last block of sb, we simply add a new block at the end of the linked list to hold the trailing bits. We also carry out update(C, c, sb, 1) and retraverse the path from the root to sb adding 1 to p(v) each time we go left from v. In this case we finish in O (log n) time. If, instead, the superblock is full, we cannot carry out the insertion yet. We first move one symbol to the previous superblock (creating a new one if this is not possible): We first delete(T, d) the first symbol c from block sb (the global position of c is d = i — pos), and this cannot cause an un- derflow of sb. Now, we check how many symbols does superblock sb — 1 have (this is easy by subtracting the pos numbers corresponding to accessing blocks sb — 1 and sb). If superblock sb — 1 can hold one more symbol, we insert the removed symbol c at the end of superblock sb — 1. This is done by calling insertc> (T, d), a recursive invocation that now will arrive at block sb — 1 and will not overflow it (thus no further recursion will occur). If superblock sb — 1 is also full or does not exist, then we are entitled to create a sparse superblock between sb — 1 and sb, without breaking the invariant on sparse superblocks. We create such an empty superblock and insert symbol c into it, using the following procedure: We retraverse the path from the root to sb, updating r(v) to r(v) + 1 each time we go left from v. When we arrive again at leaf sb we create a new node ß with r(ß) = 1 and p(ß) = 1. Its left child is the new empty superblock, where the single symbol c is inserted, and its right child is sb. We also execute insert(C, sb) and update(C, sb, c , 1). After creating ß, we must check if we need to rebalance the tree. If it is needed, it can be done with O(.) rotations and O (log n) redaASblack tag updates. After a rotation we need to update r() and p(.) only for one tree node. These updates can be done in constant time. Now that we have finally made room to carry out the original insertion, we rerun insertc(T,i) and it will not overflow again. The whole insert operation takes O (log n) time. To compress we divide every single superblock into segments representing [2 logCT n\ original simbols from S. We then represent each segment using the (c, o)-pair encoding of Ferrag-ina et al. [1]. For alphabet larger than a = Q( ^"p^) we use a p-ary wavelet tree [1] over T, where p = ©( ^".g^). On each level we store a sequence over alphabet of size p, which are handled as described above. This way we get: THEOREM 2. Given a text S of length n over an alphabet of size a and zero-order entropy H0(S), the Dynamic Sequence with Indels problem under the RAM model with word size w = 0(log n) can be solved using nHo (S) + O( ) bits of space, supporting queries access, rank, select, insert and delete in O(log n(1 + 1og°gogn)) worst-case time. 3. IMPROVED RANK AND SELECT ON DYNAMIC STRINGS Recently He et al. in [6] managed to improve the result of Gonzalez and Navarro. They followed the same idea, but instead of using red-black trees, they used B-tree. Firstly we need to introduce their modification to CSPSI solution. Then we will look at their result. We construct a B-tree over the collection C. Let L = \ 1OgOrgo"g1n1 ]. Each leaf of the tree stores a superblock of size between L/2 and 2L bits. Each superblock stores the same number of integers from each sequence in C. More precisely, the content of the leftmost leaf is sj; . ..s^1 sj... s2 . ...s^ ...s^ , the second sj1+1 . ..sj2 s'21 + 1 ...s^2 ...sl1+1 ...s%2 and so on. bl,b2,... satisfy the following conditions because of requirement on the sizes of superblocks: L/2 < bika < 2L,L/2 < (b2bi)ka < 2L,... Every internal node v stores some additional counters (let h be the number of children of v): A sequence P(v)[1..h], in which P(v)[i] is the number of positions stored in the leaves of the subtree rooted at the i-th child of v for any sequence in C (the naumber is the same for every sequence in C); A sequence Rj (v)[1..h] for each j = 1, 2,...d, in which Rj (v)[i] is the sum of the integers from sequence Sj that are stored in the leaves of the subtree rooted at the i-th child of v. We additionally divide each superblock into blocks of [[logn] 3 ] bits each. The described data structure supports sum, search an update operations in O( n) time and operations insert and delete in O(, logn ) amortized time. Proofs are omitted but ^ log log n ' can be found in [6], Section 3. We can now turn our interest in the rank/select problem. We construct a B-tree over S. If a superblock has fewer than L bits we called it skinny. The string S is initially partitioned into substrings that are stored in superblocks. We number the superblocks from left to right, so that i-th superblock stores i-th substring. There must not exist two consecutive skinny superblocks (this way we bound the numbers of leaves), which gives us the upper bound of O() leaves. Note that a bit vector of length n can be represented using nH0 + o(n)+O(w) bits and still supporting all the operations in O( i-^f^) time. log log n 4. He et al. chose b = Vlog n. They additionally required that the degree of each internal node is between b and 2b. For each internal node v we need to store the following data structures (let h be the number of children of v): A sequence U(v)[1..h], in which U(v)[i] is the number of superblocks contained in the leaves of the subtree rooted at the i-th child of v; A sequence I(v)[1..h], in which I(v)[i] stores the number of characters stored in the leaves of the subtree rooted at the i-th child of v. Finally for each character c, we construct an integer sequence Ec[1..t] in which Ec[i] stores the number of occurrences of character c in superblock i. We create a sequences this way, and we construct a CSPSI structure, E, for them. Following the approach of Ferragina et al. in [1], He et. al managed to get the following result: THEOREM 3. Under the word RAM model with word size w = fi(log n), a string S of length n over an alphabet of size a = O(ylog n) can be representd using nH0 +O(n logJiofgnlos n) + O(w) bits to support access, rank, select, insert and delete in O( l logn ) time. log log n DYNAMIZING SUCCINCT DATA STRUCTURES Gupta et al. in [5] presented a framework to dynamize succinct data structures. With their framework it is possible to dynamize most succinct data structures for dictionaries, trees, and text collections. We will now present the idea behind this framework and how it compares with other data structures. The construction of their data structure is based on some static data structures which we will only list: • For a bitvector (i.e. |£| = 2) of length n, there exists a static structure D called RRR solving the bit dictionary problem supporting rank, select and access queries in O(1) time using nH0 + O(n log log n/ log n) bits of space, while taking O(n) time to construct [10]. • For a text S of length n drawn from alphabet £, there exists a astatic data structure D called wavelet tree solving the text dictionary problem supporting rank, select and access queries in O (log |£|) time, using nHo+ o(nlog |£|) bits of space, while taking O(nH0) time to construct ([4]). • For a text S of length n drawn from alphabet £, there exists a astatic data structure D called GMR that solves the text dictionary problem supporting select queries in O(1) time and rank and access queries in O (log log |£|) time using n log |£| + o(n log |£|) bits of space, while taking O(nlogn) time to construct ([2]). • Let A[1..t] be a nonnegative integer array such that ^[i] < n. There exists a data structure PS (prefix-sum) on A that supports sum and findsum in O (log log n) time using O(t log n) bits of space and can be constructed in O(t) time. • A Weight Balanced B-tree (WBB) is a B-tree defined with a weight-balance condition. This means that for any node v at level i, the number of leaves in v's subtree is between 1/2bi + 1 and 2bi — 1, where b is the fanout factor. Insertion and deleteion can be performed in amortized O(log6 n) time while maintaining the condition ([10, 7, 5]). Gupta's solution is based on three main data structures: • BitIndel bitvector supporting insertion and deletion; • StaticRankSelect static text dictionary structure su-porting rank, select and access on a text T; • onlyX non-succinct dynamic text dictionary. StaticRankSelect is used to maintain the original text T (we can use wavelet tree or GMR data structure). We need to upgrade it, so it can support insert^ (i) (insert a symbol x (/ space(bits) access, rank and select insert and delete Gonzalez et al. [3] nH0 + o(n)log a °aog n( iofe + 1)) O^g n( + 1)) He et al. [6] nH0 + o(n)log a O( log log n ( log log n +1)) O ( log gog n ( log gog n + 1) ) Gupta et al. [5] n log a + log a(o(n) + O(1)) log log n O loge n amortized Table 1: A comparison of the results [6] £ at position S[i]) and delete(i) (deletes symbol at position S[i]) and we will call this new structure inX. We use onlyX to keep track of newly inserted symbols (N). We merge the updates in StaticRankSelect every O(n1-e log n) update operations. OnlyX cannot contain more than O(n1-e logn) elements, so we can maintain it using O(n1-e log2 n) = o(n) bits. Merging N with T takes O(nlogn) time, so we get an amortized O(ne) time for updating these data structures. The final data structure is comprised of two main parts: inX and onlyX. After every O(n1-e logn) updates onlyX is merged into the original text S. The cost can be amortized to O(ne) per update. The StaticRankSelect structure on S takes n log |£| + o(n log |£|) bits of space. The other structures takes o(n) bits of space ([5]). We need to augment the above two structures with a few additional BitIndel structures. For each symbol c, we maintain a bitvector Ic such that Ic[i] = 1 if and only if the i-th occurrence of c is stored in the onlyX structure. We now describe how to support rank and select operations with the above structures. rankc(i): We first find j = inX.rankc(i) and k = inX.rankx(i) and return j + only.rankc(k). selectc(i): We first find whether the i-th occurrence of c belongs to the inX structure or the onlyX structure. If Ic [i] = 0, this means that the i-th item is one of the original symbols from S; we query inX.selectc(j) in this case, where j = Ic.rank0(i). Otherwise we compute j = Ic.rank1(i) to translate i to it's corresponding position among new symbols. Then we compute j = onlyX.selectc(j), and return inX.selectx(j ). Gupta et al. in [5] showed how to manatina Is during updates, and the space complexity of BitIndel data structures and proved the following theorem: Theorem 4. Given a text S of length n drawn from an alphabet £, we create a data structure using GMR that takes nlog |£| + o(nlog |£|)+o(n) bits of space and supports rank, select and access in O (log log n + loglog |£|) time and insert and delete updates in O(ne) time. 5. CONCLUSION We presented some recent research done in the field of succinct data structures for text with the support for update operations. Gonzalez et al. ware the first to manage to get the operations in O (log n(1 + loglOg n)) time with only nH0 + o(n)log a bits. Quickly after He improved their result with a slight modification of this algorithm, by replacing the red-black tree used by Gonzalez with a B-tree. They cut the wort-case time complexity by a log log n factor. Gupta et al. took a totally different direction and managed to get a better query time complexity but sacrificed update times. Their solution is designed under the assumption that the string is queried frequently but updated infrequently. The research of the succinct data structures has got a lot of momentum in the last few years [9] and it produced surprising results. The most successful indexes are able to obtain almost optimal space and search time. 6. REFERENCES [1] P. Ferragina, G. Manzini, and V. Makinen. Compressed representations of sequences and full-text indexes. A CM Transactions on .. ., Jan 2007. [2] A. Golynski, J. Munro, and S. Rao. Rank/select operations on large alphabets: a tool for text indexing. Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 368-373, 2006. [3] R. Gonzalez and G. Navarro. Rank/select on dynamic compressed sequences and applications. Theoretical Computer Science, Jan 2009. [4] R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. ... of the fourteenth annual ACM-SIAM ... , Jan 2003. [5] A. Gupta, W. Hon, R. Shah, and J. Vitter. A framework for dynamizing succinct data structuresaNlJ(2008). works.bepress.com. [6] M. He and J. I. Munro. Succinct representations of dynamic strings. arXiv, cs.DS, May 2010. [7] W. Hon, K. Sadakane, and W. Sung. Succinct data structures for searchable partial sums. Algorithms and, Computation, pages 505-516, 2003. [8] G. Jacobson. Space-efficient static trees and graphs. Foundations of Computer Science, 1989., 30th Annual Symposium on, pages 549-554, 2002. [9] G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Computing Surveys (CSUR), Jan 2007. [10] R. Raman, V. Raman, and S. Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms (TALG), 3(4):43, 2007. Propagation Algorithms for Protein Classification Andras Bota Institute of Informatics University of Szeged, Hungary bandras@inf.u-szeged.hu ABSTRACT In this paper we propose a simple propagation algorithm for protein classification. It is similar to existing label propagation algorithms, with some important differences. The method is general so it can be used for other purposes as well, although in this paper it is used solely for solving a binary classification problem, and its results are evaluated using ROC analysis. It performs in reasonable time, and produces fairly good results. We have used various databases for testing purposes including SCOP4Ü and 3PGK. General Terms Classification Keywords label propagation, protein classification, binary classification, ROC analysis Supervisor Miklos Kresz1 1. INTRODUCTION The categorization of genes and proteins is an important field of research, and a variety of methods and algorithms has been tried to solve this problem, starting from simple pairwise comparison to the more advanced methods of artificial intelligence, like support vector machines or hidden Markov models. A database of protein sequences can be effectively represented as a network, in which edges represent some sort of similarity. This similarity network is known a priori and forms the basis of the classification process. On the other hand, similarity networks are usually large, so only fast algorithms are able to tackle with the problem. Label propagation algorithms originate from graph theory, 1Hungary, University of Szeged, Department of Applied Informatics, contact: kresz@jgypk.u-szeged.hu and have the advantage of using the entire network, represented as a similarity matrix, for the process of classification. Propagation algorithms are in average faster then their more advanced opponents, and can produce reliable results. There are various propagation methods currently in the field of bioinformatics [1, 2, 5, 10]. Our method is in many ways similar to these methods, but far more simpler than them, still producing good results. 2. PROTEIN CLASSIFICATION METHODS AND EVALUATION In this chapter we will give a short introduction to protein classification, as well as ROC analysis, which is a widely accepted measurement method in bioinformatics. We will also discuss a few propagation methods. 2.1 Protein classification There are various similarity measures that can be applied to proteins. Databases like COG are based on functional similarities, while other databases (like SCOP4Ü) are based on structural similarity. More precisely the similarity is based on the three dimensional structure and amino acid sequences of the proteins [10]. Two protein sequences are similar if they contain subsequences that share more similar amino acids than would be expected to occur by chance. Sequence comparison algorithms can be based on exhaustive search, like the Smith-Waterman algorithm, but heuristic algorithms, like BLAST, are also used frequently. Other forms of similarity measurements include structural comparison methods (like PRIDE or DALI). For details about the above methods see [3]. The measurement counted by these methods is a true similarity measurement, meaning that for very similar proteins, they produce a high score, and for very different proteins they produce a value close to zero. It might be worth mentioning, that some algorithms produce a distance measurement, which means, that similar elements have a score close to zero, and different elements have a greater score. Protein classification is traditionally considered to be a binary classification problem [2], although the one-class classification approach is also very popular [1]. In this paper, we stick to the traditional approach, meaning we consider the training dataset as a set of both positive and negative examples, and we expect the algorithm to produce the same classification. matcos-io 9 The SCOP40mini database consist of 1357 protein domain sequences, which belong to 55 superfamilies [3]. The sequences were divided into training and test datasets. Members of the test datasets were chosen so that they were not represented in the training set, meaning there was a low degree of sequence similarity and no guaranteed evolutionary relationship between the two sets. The SCOP40 mini database contains nine different similarity measurements, based on sequence and structural comparison methods [3]. 2.2 ROC Analysis Measuring the goodness of the results of an algorithm is also an important issue. One of the most widely accepted measurement methods in bioinformatics is receiver operating characteristics (ROC) analysis. ROC analysis provides both visual and numerical summary of the classifiers results [9]. Originally developed for radar applications in World War II, it became widely used first in medical applications, later in machine learning and data mining. Receiver operating analysis is generally used for evaluating the results of a binary classifier. A binary classifier assigns an element to one of two classes, usually noted positive or negative (+ or -). The classification process uses two distinct datasets, the train and the test dataset. In machine learning, the training dataset is used to train the algorithm, in other methods it is used as the basis of calculating the results. Classification methods can be divided further into two distinct categories. Discrete classifiers only predict the classes where the test elements belong. This means, that the classifier can produce 4 different results: true positive, true negative, false positive and false negative. From these, a contingency table can be constructed, and various measurements can be counted. Of these, two important ones should be mentioned: TPR (true positive rate) = FPR (false positive rate) = TP TP + FN FP FP + TN Where TP, TN, FP, FN denote the four cases mentioned above. The other type of classifiers is the probabilistic classifier. This assigns a value between 0 and 1 to each member of the test set, which can be viewed as a degree of similarity to the distinct classes. This score is then used to create a ranking of elements, and the classifier is good, if the positive elements are at the top of the list. More precisely if we apply a threshold value to this ranking we can create a discrete classifier. By applying multiple threshold values, we can create an infinite number of classifiers. Using these threshold values we can create a ROC curve by counting the true and false positive rates for each threshold value. An example can be seen on Figure 1. Each point of the curve corresponds to a threshold value, which in turn corresponds to a discrete classifier. A ROC curve can be evaluated both numerically and graphically. The ideal classifier creates a ROC curve that is identical to the unit step function. A classifier which assigns test elements to the classes at random generates a ROC curve that is identical Figure 1: ROC curve to the unit ramp function. In general, the higher the curve, the better the classification. It must be mentioned, that the curve is only defined between 0 and 1 on each axis. This is of course derived from the fact that both TPR and FPR are values between 0 and 1. By counting the integral value of the curve we get the "area under curve value", or AUC. The AUC value of an ideal classifier is 1, the random classifier has a value of 0.5. The AUC value can be interpreted as the probability that a randomly chosen positive element is ranked higher than a randomly chosen negative element. In this paper the AUC value in itself is used to measure the goodness of the results, although it is worth mentioning that a high AUC value does not guarantee that the positive elements are ranked at the top of the list. This topic is discussed further in [8, 9]. 2.3 Propagation algorithms Originating from graph theory, propagation algorithms require a network containing vertices and edges as the basis of the method. Perhaps the most known propagation method is the PageRank algorithm [6], first used by Google for information retrieval tasks. Note, that Kleinberg had proposed a very similar algorithm for finding hubs and authorities earlier [4]. There are various ranking algorithms specifically for protein classification, namely [1, 2, 5, 10] . In general, these methods use a query mechanism to generate results. With the given similarity network, we add a new element as query, and the classification algorithm calculates the class (or the probability) this element belongs to. In comparison, most algorithms from graph theory [7] use the entire network for propagation. In label propagation, some nodes of the network have initial labels, some not, and the purpose of the algorithms is to assign labels to nodes, that did not have labels previously. The method described in this paper uses the latter approach, and volunteers to solve the classification task without the need to cut the edges of the similarity network. 3. A SIMPLE PROPAGATION ALGORITHM The method we propose in this paper is a simple general propagation algorithm for function approximation closely based on the works of [7]. The method is general in a sense, that it was not created solely for the purpose of protein classification, although in this paper we will stick to this example of its application. 3.1 The label propagation algorithm by Ragha-van et al.[7] Originally a community detection algorithm, the label propagation method of Raghavan et al. is based on a simple, but effective idea. In a graph (directed or not) we assign labels to each node. In one iteration step, each node x adopts the label that a maximum number of its neighbors xi, X2,..., xk share. When a tie happens, a random selection is used. At the beginning, each node is initialized with a different label, and as labels propagate through the network, sooner or later a consensus emerges among the densely connected subgraph of the original graph. Communities are formed this way. It is also easy to generalize this method to weighted graphs. An example of this process can be seen on Zachary's karate club network [11] on Figure 2. The three different shades of grey represent the different labels. Figure 2: Propagation example The labels can be updated synchronously or asynchronously. In the first case the labels assigned at iteration number t depend only on the labels of iteration t — 1 : Lx(t) = f (Lx1 (t — 1), Lx2 (t — 1),..., Lxk (t — 1)), where L(x) denotes the label of node x at iteration t, and f is the update function. If there are subgraphs in the network, that are bipartite or nearly bipartite in structure this method might oscillate. The asynchronous update method solves this problem: we consider a node in iteration t. There are some neighbors that have already updated their labels in iteration t : xi, x2,..., xm, and some have not, these have the labels from iteration t — 1: xm+i,xm+2, •••,xk. We update the label of the node based on the labels of the current iteration in the case of neighbors, that have already updated their labels. For neighbors not yet updated, we use the labels of iteration t — L Lx(t) = f (Lxi (t), Lx2 (t), Lxm (t), Lxm+1 (t — 1),Lxm+2 (t — 1),Lxk (t — 1)). In the synchronous case the order in which we choose the nodes for updating is irrelevant. In the asynchronous case the order can be important. In the original method a random approach is used. This algorithm can also be called the coloring algorithm, because the labels can be viewed as colors, and in the end, each community has a different color. The colors or labels can also be viewed as categories, thus if we have only two labels, we have two categories, and this correspons to a binary classification problem. 3.2 Application for binary classification As we have mentioned before, in binary classification the two classes are usually called positive and negative, and there are two distinct datasets used by the algorithm, the train dataset, and the test dataset. A similarity matrix is constructed (according to some comparison method) between the elements of both datasets. The similarity values are available for the whole dataset, even between the test and train datasets s(x,y) : x,y £ (TrainUTest). If we consider the dataset elements as nodes of a graph, this similarity value defines an edge weight, creating a weighted graph containing both the elements of the train and test datasets. With a small modification, the algorithm described above can be applied to the protein classification problem. In the original algorithm each node receives an initial label different from all other labels. This is of course not correct for the purpose of binary classification. The labels are only known a priori for the train dataset, so only these nodes get an initial label, and as the algorithm terminates, the unlabeled nodes will get labels as well. If we apply this algorithm directly to the problem described above, we get very poor results. This is not surprising, since this method (and most community detection methods) are not designed for complete graphs like the similarity networks of this problem. One obvious way to solve this problem is to convert the graph into an unweighted form by cutting the edges according to some weight limit, but this results to the also obvious problem of choosing the appropriate value. With luck or after a significant number of trial and error steps, it is possible to get good results. Still, this is not a feasible solution to the problem. 3.3 The AProp method The method proposed in this paper uses a different approach. Rather than using discrete labels for the algorithm, we work with continuous ones. Theoretically there are no restrictions on the labels we can use, although in our approach, we have used real numbers to represent the labels. If we choose two distinct labels for the original classes (like 0 and 1), the algorithm will assign values to the uninitialized nodes between these. It is important to point out, that these values should not be thought of as probabilities. They represent some "closeness" to one class or the other. These values can then be evaluated with ROC Analysis to produce a goodness measurement. The update rules are only shown for the synchronous method. There are two reasons for this. In the original algorithm, the asynchronous update rule was introduced to solve convergence issues. These issues occurred because nearly bipartite subgraphs caused fluctuations of the discrete labels. Using continuous labels solves this problem, thus the asynchronous rules are not needed. In contrast to this, just for the sake of curiosity we have implemented the asynchronous version as well, but testing showed that they do not perform as good as the synchronous method. The labels are updated according to the formula described in the previous section, the only difference being the update function f changing to a simple summation. The label of node x changes based on the labels of its neighbors N(x): of the summation. To do this, first we have to sort out the unlabeled nodes from the neighborhood sets. Also, when we count the weighted degree, we should only consider edges that are incident to labeled nodes. This somewhat changes the update rule. Lx(t) = df ^ s(x,y)Ly (t - 1), s yeNs (x) where Ns (x) denotes the sorted neighborhood, and ds denotes the sorted degree. The second method is to assign a value to the uninitialized status. This way, we do not have to change the algorithm at all. But this also leaves us with the question: What value should be used for the uninitialized nodes? This solution is called the "biased" approach. The value of the uninitialized nodes should be chosen accordingly to the two class labels. Depending on these considerations, the algorithm can be further divided into three categories: Lx(t) = d J2 Ly(t - 1), yeN (x) where d denotes the degree of x. It is easy to extend this rule for weighted graphs. Let dw denote the weighted degree of x. Lx (t) -1 ^ s(x,y)Ly(t - 1), yeN(x) where s(x,y) is the similarity value between nodes x and y mentioned above. It is worth noting, that the similarity value is symmetric s(x, y) = s(y, x), so we have designed the algorithm to work with undirected graphs. The initialized labels never change, so they should be left out of the update process. This also reduces time complexity significantly. The algorithm consists of iteration steps, the number of iterations being a possible parameter. Another way to stop the algorithm would be to count the changes that happen in an iteration, and when this change is below a given minimum, the algorithm halts. Our experience shows us, that no significant change occurs after 6 to 8 iterations. The label propagation algorithm of Raghavan et al. begins with initializing each node with a different label. We have dropped this approach because we wanted to use the algorithm for binary classification, but since we do not know the labels of the members of the test database a priori, some nodes will be left uninitialized. This leads us to the problem of representing the uninitialized status of nodes. There are two ways of solving the problem. The first way is the "blind" approach. In this case, we do not represent the uninitialized status at all. These nodes can be selected for updating labels, but otherwise they are simply left out Neutral biased. The value is neutral towards the class labels. For example, the class labels are 0 and 1, the value is 0.5. Directly biased. The value is one of the class labels. For example, the class labels are 0 and 1, the value is 0 or 1. Indirectly biased. Any value, that is not one of the class labels or the neutral value. In the case of 0 and 1, any value other than 0,1 and 0.5. In the last two cases, further distinction can be made depending on what class label do we choose, or in the third case what class label is closer to the uninitialized value. In the end, there are four (or six) variations of the algorithm. Perhaps it is not a big surprise that these variations perform slightly differently when applied to the protein classification problem. 4. RESULTS Our method was tested on the SC0P40mini and 3PGK databases [3]. The SC0P40mini database consists of 55 classification tasks, and there are seven similarity (or distance) networks available for it. The 3PGK database consists of ten experiments and four similarity (or distance) networks. Further details of the networks can be seen on [3]. In Tables 2 to 5, BLAST stands for the basic local alignment search tool. SW stands for the Smith-Waterman algorithm. NW stands for the Needleman-Wunsch method. LA stands for the local alignment kernel method. Iterations 1 2 3 4 5 6 7 8 10 16 Ex1 1 1 1 1 1 1 1 1 1 1 Ex2 1 1 1 1 1 1 1 1 1 1 Ex3 0.9548 1 1 1 1 1 1 1 1 1 Ex4 0.7391 0.8009 0.8571 0.893 0.9207 0.9253 0.9317 0.9336 0.9345 0.9354 Ex5 0.8827 0.9215 0.9279 0.9236 0.9204 0.9118 0.9107 0.9107 0.9096 0.9096 Ex6 1 1 1 1 1 1 1 1 1 1 Ex7 0.709 0.7909 0.8227 0.8363 0.85 0.85 0.85 0.85 0.85 0.85 Ex8 0.9727 0.9886 0.9931 0.9931 0.9931 0.9931 0.9931 0.9931 0.9931 0.9931 Ex9 1 1 1 1 1 1 1 1 1 1 Ex10 0.7613 0.7982 0.8153 0.8295 0.8352 0.838 0.838 0.838 0.838 0.838 Avg 0.9019 0.93 0.9416 0.9475 0.9519 0.9518 0.9523 0.9525 0.9525 0.9526 • PRIDE stands for Pride structural similarity. • LZW stands for the Lempel-Ziv-Welch method. • DALI stands for DaliLite structural alignment. • PsiBLAST stands for Position-Specific Iterated BLAST. We have also compared our results with other algorithms provided as a benchmark. Further details of these can be seen on [3]. In Tables 2 to 5, • 1nn stands for the one nearest neighborhood method. • RF stands for random forest classification. • SVM stands for a support vector machine. • ANN stands for an adaptive neural network. • LogReg stands for logistic regression. • AProp stands for approximate propagation. 4.1 Iteration number For determining the optimal iteration number, we have tested the algorithm on the Smith-Waterman similarity matrix of the 3PGK database. In Table 1, the AUC values are displayed for the individual experiments and the average value as well. It can be seen that the values do not change much after iteration 7, so this was the value we have used for the rest of the evaluation. It is also worth noting that the performance does not increase monotonously before iteration 7, but fluctuate somewhat. 4.2 3PGK The results for all variations of the algorithm can be seen in Tables 2 and 3, as well as a comparison against other algorithms. In the first column, the algorithm variations can be seen. • nb denotes neutral biased. • dbn denotes directly biased towards negative. Table 2: Results for 3PGK Method BLAST SW LA blind 0.9241 0.9470 0.9484 nb 0.9234 0.9412 0.9481 dbn 0.9336 0.9518 0.9473 dbp 0.8973 0.9029 0.9483 idbn 0.9267 0.9317 0.9444 idbp 0.8898 0.9048 0.9485 Table 3: Results for 3PGK Method BLAST SW LA 1nn 0.8633 0.8605 0.8596 RF 0.8517 0.8659 0.8755 SVM 0.9533 0.9527 0.9549 ANN 0.9584 0.9548 0.9564 LogReg 0.9537 0.9476 0.9593 AProp 0.9336 0.9518 0.9485 • dbp stands for directly biased towards positive. • idbn stands for indirectly biased towards negative. • idbp denotes indirectly biased towards positive. If we consider only the variations of the approximation algorithm, we can see that none of the approaches appear to be dominant. It can also be seen, that our method performs almost as good as the best approaches. 4.3 SCOP40mini The results for all variations of the algorithm can be seen in the Tables 4 and 5, as well as a comparison against other algorithms. Unlike in the previous case, the idbp approach outperforms all other variations. We can only guess the reason for this behaviour. It is possible, that the experiments in this dataset are structured in a way, that initializing the test dataset towards the positive class label greatly enhances performance. In the implementation, the positve class was labeled 1 and the test dataset was intialized with the value of 2. Table 4: Results for SCOP40mini Method BLAST SW NW LA PRIDE LZW PsiBLAST DALI blind 0.899 0.9476 0.9337 0.9442 0.8891 0.8261 0.9083 0.9974 nb 0.9245 0.9492 0.9353 0.9455 0.8897 0.8267 0.9239 0.995 dbn 0.8909 0.9475 0.9335 0.944 0.889 0.826 0.9065 0.9974 dbp 0.932 0.9509 0.9372 0.9466 0.8903 0.8273 0.9304 9977 idbn 0.66 0.9434 0.929 0.9412 0.8876 0.8247 0.7792 0.994 idbp 0.9382 0.9536 0.9405 0.9491 0.8914 0.8283 0.9363 0.998 Table 5: Results for SCOP40mini Method BLAST SW NW LA PRIDE LZW DALI 1nn 0.7577 0.8154 0.8252 0.7343 0.8644 0.7174 0.9892 RF 0.6965 0.8230 0.8030 0.8344 0.9105 0.7396 0.9941 SVM 0.904 0.9419 0.9376 0.9396 0.9361 0.8288 0.9946 ANN 0.7988 0.8875 0.8834 0.9022 - 0.8346 - LogReg 0.8715 0.9063 0.9175 0.8766 0.9029 0.7487 0.9636 AProp 0.9382 0.9536 0.9405 0.9491 0.8914 0.8283 0.998 It can be seen that our method performs better than the other algorithms in most cases. The running time of the algorithm is very low. The database consists of 55 classification tasks. Solving all of these tasks took 15 minutes on an average notebook computer. This means that for the evaluation of one task, around 16 seconds is needed. 5. CONCLUSION AND FUTURE WORKS It can be seen from the results, that our algorithm generally performs better on larger databases, with the running time being sufficiently low. The method is relatively general in a way, that it can be used as an approximation algorithm, but it can be extended in several ways described below. This method is very sensitive to the starting labels of the nodes. This can be a weakness, but a proper learning algorithm can exploit this property and improve the overall performance of the algorithm. Making the method more general can be difficult. The labeling approach is designed to handle binary classification tasks. One way to extend this would be to use multidimensional labels. Using one number to describe a label corresponds to a one dimensional coordinate. Extending the number of dimensions for the labels would result in multiple classes, while one coordinates would indicate some sort of a membership function towards one class or the other. Acknowledgment The author was supported by the Project named "TAMOP-4.2.1/B-09/1/KONV-2010-0005 - Creating the Center of Excellence at the University of Szeged" also supported by the European Union and co-financed by the European Regional Fund. 6. REFERENCES [1] A. Banhalmi, R. Busa-Fekete, and B. Kegl. A one-class classification approach for protein sequences and structures. In Proceedings of the 5th International Symposium on Bioinformatics Research and, Applications, ISBRA '09, pages 310-322, Berlin, Heidelberg, 2009. Springer-Verlag. [2] R. Busa-Fekete, A. Kocsor, and S. Pongor. Tree-based algorithms for protein classification. In Computational Intelligence in Bioinformatics, pages 165-182. 2008. [3] http://net.icgeb.org/benchmark/. Protein benchmark database. [4] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46:604-632, September 1999. [5] A. Kocsor, R. Busa-Fekete, and S. Pongor. Protein classification based on propagation of unrooted binary trees. Protein and Peptide Letters, 15(5):428-437, June 2008. [6] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120. [7] U. N. Raghavan, R. Albert, and S. Kumara. Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E, 76(3):036106, Sep 2007. [8] A. K. Raobert Busa-Feketea, Attila Kertaesz-Farkasa and S. Pongor. Balanced roc analysis (baroc) protocol for the evaluation of protein similarities. Journal of Biochemical and Biophysical Methods, 70(6):1210 -1214, 2008. [9] P. Sonego, A. Kocsor, and S. Pongor. Roc analysis: applications to the classification of biological sequences and 3d structures. Briefings in Bioinformatics, 9(3):198-209, January 2008. [10] J. Weston, A. Elisseeff, D. Zhou, C. Leslie, and W. Noble. Protein ranking: from local to global structure in the protein similarity network. PNAS USA, 101:6559-6563, 2004. [11] W. W. Zachary. An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33:452-473, 1977. Optimal and Reliable Covering of Planar Objects with Circles Endre Palatinus Institute of Informatics University of Szeged 2 Ärpad ter Szeged, Hungary H-6720 palatinuse@gmail.com ABSTRACT We developed an algorithm for finding the sparsest covering of planar objects with a given number of circles with fixed centers. The covering is tested reliably using interval arithmetic. We applied our solution to a telecommunication related problem. Supervisor: Dr. Balazs Banhelyi assistant professor, Institute of Informatics, University of Szeged Categories and Subject Descriptors G.1.6 [Numerical Analysis]: Optimization—Global optimization 1. INTRODUCTION The circle packing problem has attracted much attention in the last century, and a variant called packings of equal circles in a square receives attention even nowadays [1]. The objective of it is to give the densest packing of a given number of congruent circles with disjoint interiors in a unit square. However, its dual problem, the circle covering has not been exhaustively studied so far. Optimal coverings of the unit square with congruent circles of minimal radii have been found[2] only for small numbers and their optimality have been proved mathematically. Computational methods have been developed[3] to find solutions for higher circle counts, but the produced results were neither computationally reliable, nor provable approximations of the global optimum. Our aim is to develop a global optimization method for finding the sparsest reliable covering of a convex or concave planar object with a given number of circles with fixed centers where the sizes of the circles' radii may be different. By sparsest we mean minimizing the sum of the circles' area. We have chosen this fitness function, because it can prove useful in many applications, such as covering a region with terrestrial radio signals and determining the optimal energy consumption of wireless sensor networks. 2. RELIABLE COMPUTING The CPUs of todays personal computers have an arithmetic optimized for computer games. This means that the speed of the computations is more important than the accuracy of them. The floating point computations performed by these CPUs yield results in the domain of the numbers which can be represented in their arithmetic. Since every result is only an approximation of the true results, the error of a complex computation can be arbitrarily large. Consider for example the following expression: 2100 + 2010 - 2100 =? When calculating the result of this expression using double-precision floating point arithmetic the result will be 0 instead of 2010. This is due to the limited precision of the arithmetic. Mathematicians don't accept computer assisted proofs if they contain computations performed with unreliable arithmetic. Even the solution of global optimizers can be doubted to be globally optimal ones if we consider that their computations contain accumulated rounding errors. Interval arithmetic is one solution for handling the uncertainty of computations. It represents the results of computations as intervals, which are guaranteed to contain the precise result. This way it incorporates the rounding errors, too, but does not hide it as the floating point arithmetic does. 3. THE CIRCLE COVERING PROBLEM Let (r1,r2,... ,rn) denote the radii of the circles in our constrained optimization problem. Therefore the fitness function is: n f ((r1, r2,... , rn)) = ^^ r2 ^ min i= 1 To solve the previous problem we need: • a reliable method for testing if a given arrangement is covering or not • and an optimizer for finding the optimal one. Checking if a given arrangement is covering or not can be done by checking whether every point of the planar object matcos-io is covered by any of the circles. This is where interval arithmetic proves to be quite useful, because applying it we can check whether a rectangle is covered by any of the circles simply by replacing points with intervals. An additional benefit of using it is that our algorithms become reliable by avoiding uncertain computations. In our implementation we used C-XSC[4], a widely recognized C++ class library for extended scientific computing, which provided the interval arithmetic extension of all the mathematical operations. We can define the covering property as follows: Every circle is an open set on a plane. The desired property for every p point of the planar object and some (center, r) circle: sup((p — center)2) < inf (r2) (a) n = 3 It is straightforward to use a Branch-and-bound-based method for checking the covering. We have chosen a parallelized B&B method introduced in [5], since there is a great potential in traversing the different branches of the B&B search tree concurrently on different CPU cores. We can even check general properties of multidimensional intervals, as the following statements say: If we have a method capable of proving a P property of a point and its neighborhood, then the B&B-based testing method is able to prove this P property for a multidimensional interval. In case the interval has the given property we have to traverse the whole B&B search tree, and the subtrees can be traversed concurrently by different CPU cores. If there is no lower bound on the width of the subintervals created, then in the positive case the algorithm terminates in finite steps with positive answer[6]. In practice there's always a lower bound (e), and if the width of a subinterval reaches this bound and it does not have the property, then we cannot decide the problem and return a negative answer. To demonstrate the running of the testing method, we will use some simple testcases: n2 circles on a regular n * n grid. The results can be seen on figure 1. As our experiments have shown, we have gained an increase in performance comparable to the number of CPU-cores[7] on these simple test cases, as shown on figure 2. We can also state according to figure 3 - which is the plot of the runtime divided by n4 versus n - that the runtime of the algorithm is O(n4), that is O(m2) if m denotes the total number of circles, which is equal to n2. 4. THE MATHEMATICAL MODEL OF THE PROBLEM Now that we have introduced a reliable method for testing if a given arrangement is covering or not, we can concentrate on our second problem: finding the optimal one. First we have to examine the mathematical model of the problem. We start with some simple definitions: We call r = (ri, r2,..., rn) a configuration, which is a vector holding the radii of the circles. A configuration is bad if the corresponding circles do not cover the planar object, and good, otherwise. Let C = ([Ci, čl],..., \cnj ~čn\) denote a set of configurations, where the i-th component of every configuration in the set falls into the \ci, interval. The two distinguished configu- (b) n = 4 Figure 1: The running of the testing method. 5 10 15 20 Number of Circles 25 30 Figure 2: The efficiency of the testing method. Figure 3: The runtimes of the testing method. 0 (a) Both bad (b) Both good (c) C bad and C good Figure 4: Properties of the configuration sets. rations of C are: C = (ci,... ,Cn) and C = (čj",.. . , Č57). Now we will show some simple properties of these configuration sets, which will prove useful in constructing an optimizer for the radii sizes. If the distinguished configurations, C and C of C are bad, then the C" = ([0, cT],... , [0, č^]) set cannot contain the optimal solution. If the distinguished configurations, C and C of C are good, then the C \ C set cannot contain the optimal solution. It follows that only such a C set may contain the optimal solution, whose C configuration is bad, and C configuration is good. On figure 4 we have depicted these statements in two dimensions. 5. DETERMINING THE OPTIMAL RADII SIZES Finally, we can introduce our B&B-based global optimizer for the radii sizes. Optimizer algorithm: • Let Q be an empty minimum priority queue, whose C'i elements' priority is determined by the value of the fitness function for the Ci configuration. • Push the Co = ([0, ci],.. . , [0, c,,]) initial set of configurations into Q, where Co is a configuration with a suitably high fitness value. • Loop: — Pop a C set of configurations from Q. — Subdivide C along its widest side into two equal parts, Cl and C2, in such a way that C = Ci and Č = ČŠ. — If Cl is a good configuration, than push Cl into Q. — If C2 is a bad configuration, than push C'2 into Q. • Until the width of C is less than e. • Return C as the solution. Using the properties of the configuration sets it can be proved that the above algorithm finds an approximation of the global optimum with arbitrary precision. 6. APPLICATIONS We can apply our results for the problems mentioned in the introduction. Covering Hungary with terrestrial radio signals is a problem where we have to cover a concave planar object with circles, since the signals cover a circular area around the broadcasting stations. The area covered by a station is linearly dependent on the power fed into it, therefore minimizing the power consumption of the hole network is equivalent to minimizing the sum of the area of each circular region covered by the stations, which happens to be the value of our fitness function. Some optimal solutions can be seen on figure 5. Similar problems have been solved in [8] but with unreliable computations. Some optimal coverings of the unit square are also shown on figure 6. 7. REFERENCES [1] Szabo, P.G., M.Cs. Marköt, T. Csendes, E. Specht, L.G. Casado, and I. Garcia: New Approaches to Circle Packing in a Square - With Program Codes. Springer, Berlin, 2007. [2] Erich Friedman: Circles Covering Squares. http://www2.stetson.edu/ efriedma/circovsqu/, (2005) [3] Nurmela, K.J. and P.R.J. Östergard: Covering a square with up to 30 equal circles, Helsinki University of Technology, Laboratory for Theoretical Computer-Science Research Reports, 62(2000) [4] Hofschuster, Kramer, Wedner, Wiethoff, C-XSC 2.0 -A C++ Class Library for Extended Scientific Computing, Preprint BUGHW-WRSWT 2001/1, Universität Wuppertal, 2001. [5] Casado, L.G., J. A. Martinez, I. Garcia, and E.M.T. Hendrix :, Branch-and-Bound interval global optimization on shared memory multiprocessors, Optimization Methods Software, 23(2008), 689-701. [6] Banhelyi, B., Csendes, T. and Garay B. M.: A Verified Optimization Technique to Locate Chaotic Regions of Henon Systems. Journal of Global Optimization, 35(2006), 145-160. Figure 6: Optimal coverings of the unit square [7] Palatinus, E., Banhelyi, B.: Circle Covering and its Applications for Telecommunication Networks Proceedings of the 8th International Conference on Applied Informatics, Eger, Hungary, (2010), Submitted [8] Das, G.K., S. Das, S.C. Nandy, and B.S. Shina: Efficient algorithm for placing a given number of base station to cover a convex region, J. Parallel Distrib. Comput. 66(2006), 1353-1358. TRUTH-TELLER—LIAR PUZZLES -A GENETIC ALGORITHM APPROACH (WITH TUNING AND STATISTICS) Attila Tanyi Debreceni Egyetem Egyetem ter 1. Debrecen, Hungary +36 52 512900/22792 tatil@bom.hu ABSTRACT In this paper, satisfiability-like problems, namely a family of truth-teller—liar puzzles are considered. In our puzzles, every participant may say only one complex statement (a conjunction of atomic statements) about the type of the other participants. Truth-tellers can say only true atomic statements, while a liar must say at least one false atomic statement if he/she says anything. The problem is to find the solution, i.e., which participants are liars and which of them are truth-tellers. We present a solution algorithm based on a genetic algorithm approach, where several types of tuning can be implemented. Categories and Subject Descriptors I.2.m [Artificial Intelligence]: Miscellaneous - genetic algorithms. General Terms Algorithms, Measurement, Performance, Experimentation. Keywords Genetic Algorithm, Truth-Teller—Liar Puzzle, Tuning. Supervisor dr. Benedek Nagy. 1. INTRODUCTION The truth-teller—liar puzzles are well-known. Scientists also like puzzles, since they can effectively be used in teaching, in argumentation etc. Puzzles are the topics of several books, they were analysed by, e.g., R. Smullyan [11,12,13]. Several variations of truth-teller-liar puzzles are characterized in [9]. There are Strong truth-tellers, whose all (atomic) statements are true, while a person is a Weak truth-teller if there is a true atomic statement among his statements. Similarly, a Strong liar can only say false (atomic) statements, while a Weak liar has at least one false atomic statement among the statements he is uttered. A puzzle has a category by the types of its truth-tellers and liars. In detail, SS-type puzzles are shown in [10]. We deal with special types of puzzles, called SW-type puzzles having Strong truth-tellers and Weak liars. They are solved by a graph modification method in [4,5]. They are also related to Boolean programming [5,7]. The duality of SW and WS puzzles is presented in [6]. Non-monotonic reasoning related to WW-puzzles is presented in [8]. In this paper, we consider only SW puzzles. The hardness of satisfiability-like problems suggest new approaches to try. Here, a genetic algorithm is used to solve these problems. Fine-tuning of its various parameters and functions are also shown to increase performance. The implementation is written in Java. 2. THE TRUTH-TELLER—LIAR PUZZLES In this paper we use only special truth-teller puzzles, where the participants have full knowledge on their type and they can say only sentences about these facts. There are N participants in the riddle, each of who is either a liar or a truth-teller. Statements are given in the form of 'X says: A1 and A2 and ... and Aj', where Ax, A2, ... Aj are sub-statements (or atomic statements) either in the form of 'Yisaliar' or 'Yisatruth-teller'. Where X and Y are names of arbitrary participants. Each of the participants can only tell at most one statement about the other ones. The total number of atomic statements is M._ Example: The participants are Alice, Bob and Charlie. (N=3) There are 2 statements given, which consist of 3 atomic statements. (M=3) Alice says: Bob is a liar and Charlie is a truth-teller. Charlie says: Alice is a liar. The task is to find out which participants are liars, and which of them are truth-tellers by the given statements. 3. THE GENETIC APPROACH In this section we give some details of our genetic algorithm. We do not recall all the details how a genetic algorithm works, they can be found in several text books in the topic, like [1,2,3]. For various terms used here we also refer to [ 1,2,3]. The chromosomes of the individuals are arrays of bits representing the statuses of the participants. For example (N = 7): |L|L|T|L|T|T|T| c= (0,0,1,0,1,1,1) The fitness function comes from biological motivation. It rates the solution candidates by assigning fitness values to them. The better a solution candidate is, the greater value it should have. Here, the fitness function shows how many statements are fulfilled by a permutation of statuses, f-.C ^ {0,1,..., M}. A solution is optimal, if the fitness function is equal to M, /(c) = M. (i.e. all the statements are fulfilled) The genetic algorithm works with PS individuals at the same time, basically performing a parallel search. In each step, it creates a new generation of individuals, keeping the best individual for the next generation, and creating the rest by roulette wheel selection, which is selecting parents randomly (the greater fitness value the individuals have, the bigger probability they become selected), and creating offsprings by using uniform crossover and mutation. Mutation operator is m\C^C . The mutation operator negates the bits of chromosomes with a small probability p . m((s1,s2,^,sn)) = (s/^', ...,sn') , _ ( st, probability 1 — p 1 1|1 —S;|, probabilityp Uniform crossover operator is u-.CxC^C. The uniform crossover operator assigns a new chromosome to two chromosomes where the bits at each position gets either of the values of the bits at the corresponding positions of the two chromosomes with a probability of 50%. u((s1,s2,^, sj, (t1,t2,...,tn)) = (Si ',s2 ',...,sn9 . _ fs;, probability 50% 1 (t;, probability 50%' 4. FINE-TUNING The data were gathered by running the algorithm 10, 50, or 100 times, and taking average. A run is measured by the number of individuals it creates, to which we will refer to as running time. The following parameters are tuned: N - number of participants PM - probability of mutation PS - population size First we generate the puzzle by the only parameter N, the number of participants. The program generates 5% of all possible atomic statements (for example, for 20 participants - about 19 statements) and make the complex statement from them according to the definition. 4.1 Test 1: How does running time change when decreasing PM? PS = 2000, N = 20 -■"■nTim-Tir 0 0,0125 0,025 0,0375 0,05 0,0625 0,075 0,0875 0,1 Note: At zero probability there is a bigger chance of coming to a dead end Figure 1. Running time as a function of probability of mutation. To set the value of PM to be 0,05 might seem reasonable, as 0,05=1/N, therefore the algorithm would make one mutation in each individual. However, the graph shows the tendency: the smaller the probability of the mutation, the sooner we are getting the optimal result. This is because even just one mutation (changing the status of a participant) has dramatic effects on the fitness of the individuals. We should not go all the way to zero though: at zero, the probability of coming to a dead end increases, as usual in genetic algorithms. 4.2 Test 2: How does running time change when decreasing PS? PM = 0.01, N = 25 m - - 0000000000005000000 ^HtNM'stLntDrvOOCnO'HtNtNLnOOOOO 0, (7) xe integer, (8) e where ce is the cost of edge e. The main drawback of the connection based model comes from the fact that it represents connections between all compatible trips. The number of compatible trips is high even with a small number of timetabled trips, due to the possible deadhead trips and the number of depots. This makes the size of the problem so large that it can not be used effectively on real-life data, which consists of several thousand trips. 2.3 Time-space network The time-space network has been introduced to vehicle scheduling by Kliewer et al. [7], and is able to efficiently solve even large-sized real-world MDVSP instances. As it was pointed out at the connection based model, the number of edges connecting compatible trips is high, while only a few of these are used in a feasible vehicle schedule. Dropping any of these connections would lose the optimality of the solution. However, the number of edges has to be reduced for solving the problem efficiently. The time-space network solves the problem mentioned above. The model uses two dimensions, time and space. Space stands for the set of geographical locations, while time means timelines assigned to every location. The arrival and departure times are denoted on these timelines, with each point in time being a node of the model. As can be seen, the time-space network assigns the nodes to geographical locations as opposed to the connection based network. Apart from this difference, the set N of nodes of the network can be defined similarly to the connection based approach. Ad can also be given in a similar way for each depot d E D, and P can be defined with the help of the time-lines associated with the depots. The definition of deadhead trips is the main difference between the two models. The timelines used by the time-space network can be used to aggregate deadhead trips, with the help of newly introduced waiting edges connecting adjacent nodes on the timeline. This method significantly reduces the size of the problem. Waiting edges always connect two adjacent nodes on the appropriate timeline. Denoting the set of waiting edges with W for a depot d E D, the set of edges of the network is E:= Ad U Bd U Pd U Wd U{(e(d ),s(d ))} for every d 6 D The IP model of the network is similar to the one presented at the connection based model: where n+ is the set of incoming edges to node n, and n-is the set of outgoing edges from node n. 3. SOLUTION METHODS Using the time-space network introduced in the previous section, we present some solution methods for the MDVSP. An important aspect studied in all of these methods is the applicability on reallife problems. For effective use in public transportation, solution should be obtained fast, and the cost should also be close to the optimal cost of the problem. First, the traditional approach of solving our problem with an MILP solver will be presented, then heuristics from the literature are introduced. 3.1 MILP solver The first method we applied for solving the IP problem of the time-space network was using an MILP solver (in our case: SYMPHONY1). Due to the large size of the problem, only the first feasible solution was found. The cost of the resulting solution was nearly optimal (gap: 0,1-0,2% from the optimum), but running time was slow in many cases (even 1-2 hours for some instances). Companies execute the method for a whole planning period, which usually is a time span measured in months. This planning period has several different day-types (11 in the case of Szeged), and the vehicle scheduling problem has to be solved for all of these. Solving the IP for all these problems would take a large amount of time. As it can be seen above, a fast algorithm has to be applied for the problem, if we want to use it as the part of a decision support system. For this, we examined several different heuristics from literature for the problem. 3.2 Rounding heuristic As the running time of the process mainly consists of finding the feasible integer solution of the relaxed LP-problem, accelerating the IP-solution process can result in a decrease in the running time of the algorithm. The difference between the values of the LP-relaxation of the MDVSP, and the optimal integer solution is usually small, as large percent of the variables has an integer value in the solution of the LP. The heuristic approach proposed by Suhl et al. [9] aims to utilize this feature. The IP solver is stopped, when certain criteria are met (which can be e.g. the number of visited nodes, or the difference between the cost of the actual problem and the LP relaxation), and a rounding algorithm is executed for this partial solution. The algorithm uses two rounding intervals [0,rl] and [ ru,l ], where 0 < rl < ru < 1. C onsider a variable x, and let xj -[xj] = fj, where 0 < f <1. The value of xj is rounded according to the following rules: 1 http://www.coin-or.org/SYMPHONY/ - iff j e [0 ,rl ], then [ Xj \ - iff j e [ru,1], then | Xj | Several rounding iterations can be carried out sequentially, providing that the resulting LP still gives a feasible solution. If no variables were rounded, then the rounding intervals can be extended. After the rounding steps, the branch-and-bound IP solver is executed again for the LP. The process is repeated until all variables are fixed, or the problem has no feasible solution. The gap from the optimum still remains around 0,2%, and there is a decrease in running time. However, the running time of some instances can still take around l hour to solve, which is inadequate time in certain cases. 3.3 Variable fixing heuristic This heuristic is used to further decrease the size of the MDVSP. The basic idea of variable fixing is to solve simplified problems based on the original model, and look for series of trips that are common in all solutions. If such trips exist, they are considered as stable chains, and supposed that they also appear the same way in the global optimum. Stable chains are denoted in the model as one huge trip, thus significantly decreasing the size of the problem. After all stable chains have been found, the resulting smaller MDVSP can be solved with a MILP solver. Kliewer et al. [8] proposed to use SDVSP sub-problems for all depots of our MDVSP as simplified problems. For a depot d , we solve the following sub-problem: - The capacity of the depot is equal to the sum of all depot-capacities of the MDVSP. - Only those trips are considered, which can be executed from depot d . After each SDVSP sub-problems are solved, the solutions are used to create stable chains. Using these stable chains, a new MDVSP problem is created, which has the following properties: - The number and capacity of the depots are the same as in the original problem. - The set of trips of the new problem consists of the trips not included in any of the stable chains and a newly created trip for each stable chain. The cost of these new trips is equal to the sum of the cost of the trips that are in the stable chain they represent, and has the departure time and starting location of the first trip of the chain, and the arrival time and ending location of the last trip of the chain. These trips can be executed from any depot Solving this MDVSP, the final solution can be acquired by substituting back the original trips instead of the stable chains. 4. A GREEDY VARIABLE FIXING HEURISTIC The heuristic we propose is also based on the method of creating stable chains. Initially, an SDVSP problem is solved, which is created by transforming the MDVSP in the following way: - The capacity of the depot is equal to the sum of the capacity of all the depots in the MDVSP. - The SDVSP contains all the trips from the MDVSP, and all trips can be executed from the single depot. This new SDVSP problem is solved, and stable chains are formed from this solution using a greedy method. The method assigns a value to the depots of the MDVSP using the following function: — • d_c(d) + d_b(d), e where d_c(d) is the daily cost, and d_b(d) is the distance-based cost of depot d and parameter e > 0 is an arbitrary real number. Our test cases use e = 210. The average distance covered by a vehicle in a day around 2l0 km, thus the function above gives an estimate for the total cost of the vehicle to cover 1 km. Depots are ordered into a list L according to this cost. For every depot d of the list L (starting with the one with the lowest cost), the greedy algorithm examines all the vehicle schedules in the result of the SDVSP. If subsequent trips are found which can be executed from this depot, they are considered as stable chains. These trips are flagged, and cannot be the part of other stable chains. After all depots are examined, a new MDVSP problem is created. This procedure is similar to the one described in 3.3, the only difference arises when introducing the new trips. These trips can only be executed from depots that satisfy every trip of the stable chain that the new trip represents. Let D be the set of depots of the problem, and J be the set of all trips. Let V be a set to store all flagged trips, and S be a set containing chains of trips, which are the stable chains. Let L be a list of trips that is used to build up the chains. Using the notations above, the algorithmic code of the problem can be seen on Figure 1. Greedy_variable_fixing D ^ depots with assigned values 5 ^0,V ^0,L ^0 while D is not empty, choose d e D with lowest value solve an SDVSP with all j e J trips j = first trip of the solution, where j g V while j can be executed from depot d e D , and j g V L ^ j V ^ j j = next _ trip _of( j) in the solution if L| > — S ^ L else delete j from V L ^0 Delete d from D Return S Figure 1: The greedy variable fixing heuristic. As the running time of the heuristic was fast, further changes have been experimented with to improve the cost of the solution, with a minimal increase in the running time. The more trips there are in a stable chain, the higher the chance is that the greedy method chooses an inadequate trip regarding optimality, making a cost of that stable chain higher. To avoid this, a constraint can be introduced to limit the length of the chains. The SDVSP problem is created the same way as before, the only difference is in making the stable chains: if the number of trips in a chain reaches the constraint, the chain is not extended further. Experience shows that limiting the length of the stable chains results in a solution with better cost, but has an increase in running time. Though there have not been enough tests concerning this observation yet, it looks a promising direction for further research. 5. TEST RESULTS The above methods were tested on real-life data instances in the city of Szeged, Hungary. The company uses 14 different day-types (called combinations). For a normal planning period (2 months in the case of the company), all 14 combinations have to be calculated, and the schedule for a given day depends on the combination type of the day. The results of the algorithms are presented on 4 day-types, and the properties of each can be seen in Table 1 below. The combinations with higher number of trips (szeged1 and szeged4) are working days of the week, while szeged2 and szeged3 are instances taken from a Sunday and Saturday, respectively. The running time in the table is the time in seconds needed to find the first feasible solution using the SYMPHONY solver on the time-space network model. As mentioned before, the maximum gap of all first feasible solutions from the optimum could be calculated, and gave a value of 0,2%. As can be seen, the running times of the instances are all above 1,5-2 hours. This is not acceptable in most cases, as the vehicle schedules are usually given for a longer planning period (several weeks or months), and calculating the schedules for all day-types would take more than a day this way. The results of the heuristics will be examined using three aspects: the decrease in running time compared to the MILP solution, the maximum gap to the optimal solution and the structure of the schedules, which is explained below. Decision support systems usually do not consider vehicle scheduling as a stand-alone problem, but use it as an initial input for driver scheduling and rostering. These problems have different working constrains of the drivers that have to be fulfilled, the most important is the maximum consecutive driving time without any rest, and the total length of the schedule given to the driver. Our vehicle schedules will also be analysed using this aspect, and if a schedule violates any of the two rules mentioned above (either by being "too long" for one driver, or "too dense" to give proper resting time), it is considered as a "bad" schedule regarding the driver modules. As can be seen from Table 1, the first feasible solutions give schedules with good structure in the aspect of drivers. The results given by the rounding heuristic (see: Table 2) stay close to the original gap from the optimum, and also improve the running time significantly. There are even cases (see szeged3), where we get the exact same solution as the first feasible, but with an improved running time. The structure of the solution stays very similar to the ones in Table 1, so the number of bad schedules is also low (in the case of these 4 instances, it is exactly the same). However, running time is still around 1 hour in some cases, so this heuristic is still not suitable to make quick decisions. This heuristic is recommended to be applied when running time is not an important aspect, but it is guaranteed to give well-structured schedules with low cost. Table 2. Results of the rounding heuristic Time ratio (%) Max. gap to opt. (%) "Bad" schedules szeged1 38,0301 0,2584 1 szeged2 57,8000 0,2000 5 szeged3 66,0909 0,2000 6 szeged4 40,2304 0,2061 2 Results of the variable fixing heuristic show an even better decrease in running time: each instance can be solved around 515 minutes, with the cost of increasing the gap from the optimum. However, the structure of the resulting schedules is "bad" considering driver rules. This comes from the property that the algorithm produces really dense and really long stable chains (as there are many trips in our test cases that can be executed from all depots), which result in the schedules themselves being long and dense. Results can be seen in Table 3. Table 3. Results of the variable fixing heuristic Time ratio (%) Max. gap to opt. (%) "Bad" schedules szeged1 11,6069 0,3604 12 szeged2 16,4000 0,3110 11 szeged3 10,4688 0,6474 20 szeged4 9,5098 0,3557 12 The results of the greedy heuristic can be seen in Table 4. The heuristic needs only a couple of minutes (under 4-5 in all test cases) to finish. However, the gap from the optimal solution has risen to around 1-1,5%, which value is still acceptable considering a real life application. As opposed to the variable fixing heuristic, the greedy method fixes significantly more trips (~66% in comparison with ~33%) into stable chains, which significantly reduce the size of the problem. However, fixing trips to a stable chain means that their relation to other trips is already determined. Some trips will be placed in different schedules than it would be in the optimal scheduling, and Table 1. Properites of data instances Number of trips Running time (s) "Bad" schedules szeged1 2765 9503 1 szeged2 1826 1500 5 szeged3 2033 2837 6 szeged4 2758 7119 2 increasing the cost this way. Limiting the greedy choice with alternative constraints (e.g. limit the size of the chains, or the types of chosen trips) can lead to a solution with a better cost, but this would mean a larger problem size, which results in an increase in running time. The structure of the schedules is also acceptable, the greedy heuristic stays close to the low number of bad schedules provided by the first two solution methods. 6. CONCLUSIONS AND FUTURE WORK We presented the vehicle scheduling problem, and examined models and heuristics from literature, which may prove to be useful on real-life instances, and especially in the decision planning process of a transportation company. A main criterion of these heuristics was a good running time. With the help of the examined methods, a new greedy heuristic was introduced for the problem, which gives a significant decrease in running time, while its gap from the optimum stays at an acceptable level. It is also important to note, that the structures of these schedules are also good, considering the further application of them as an input to driver scheduling and rostering. Test results show that our heuristic can be improved further. The present greedy choice has numerous properties that can be limited, and this way cost can be improved with the decrease of the running time. This method is still being tested and analysed. Alternative options can also be introduced for the greedy choice instead of the current cost-based approach. Such methods (eg. based on the number of depots that are able to serve the trips of a chain) are currently being developed.. 7. ACKNOWLEDGEMENT This paper was partially supported by Szeged City Bus Company (Tisza Volan, Urban Transport Division). I would also like to thank Zsuzsanna Papp and Jozsef Bekesi for their help with the implementation of the time-space network and the rounding heuristic. 8. REFERENCES [1] Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows. Nemhauser, Rinnooy Kan, and Todd, IV, 211-369, 1989. [2] Balogh, J., Bekesi, J., Galambos, G., Kresz, M.: An Assignment Model for Real-World Vehicle Scheduling Problem with Refueling. Proceedings of the 9th Workshop on Models and Algorithms for Planning and Scheduling Problems, Abbey Rolduc, The Netherlands, June 29 - July 03, 2009, pp. 229 [3] Bekesi, J., Brodnik, A., Kresz, M., Pas, D.: An Integrated Framework for Bus Logostics Management: Case Studies. Logistik Management 5., pp. 389-411, 2009. [4] Bertossi, A.A., Carraresi, P., Gallo, G.: On Some Matching Problems Arising in Vehicle Scheduling Models. Networks 17, pp. 271-281, 1987. [5] Bodin, L., Golden, B., Assad, A., Ball, M.: Routing and Scheduling of Vehicles and Crews: The State of the Art. Computers and Operations Research 10., pp. 63-212., 1983. [6] Bunte, S., Kliewer, N.: An Overview on Vehicle Scheduling Models. [7] Kliewer, N., Mellouli, T., Suhl, L.: A time-space network based exact optimization model for multi-depot bus scheduling. European Journal of Operational Research 175., pp. 1616-1627, 2006. [8] Kliewer, N., Mellouli, T., Suhl, L.: Solving large multiple-depot multiple-vehicle-type bus scheduling problems in practice. OR Spectrum 27. pp. 507-523, 2005. [9] Suhl, U.H., Friedrich, S., Waue, W.: Progress in solving large scale multi-depot multi-vehicle-type bus scheduling problems with integer programming. Wirtschaftinformatik Proceedings, Paper 81, 2007. Table 4. Results of the greedy heuristic Time ratio (%) Max. gap to opt. (%) "Bad" schedules szeged1 4,4828 1,3168 3 szeged2 9,4000 1,0379 7 szeged3 10,0458 0,3459 7 szeged4 4,8883 1,3176 2 B AYES IAN GAMES ON A MAXMIN NETWORK ROUTER Grigorios G. Anagnostopoulos Department of Electrical and Computer Engineering, Democritus University of Thrace 67100 Xanthi, Greece Telephone: 2541020486 (0030) griganag@ee.duth.gr ABSTRACT In this paper, we consider network games with incomplete information. In particular, we apply a game-theoretic network model to analyze Bayesian games on a MaxMin router. The results show that the MaxMin mechanism induces desirable Nash equilibria. Categories and Subject Descriptors F.2.2 [Theory of Computation]: Analysis of Algorithms and Problem Complexity—Nonnumerical Algorithms and Problems; G.2.2 [Mathematics of Computing]: Discrete Mathematics— graph theory, network problems General Terms Algorithms, Theory Keywords Bayesian games, MaxMin fairness, Network games, Nash equilibrium, Algorithmic Game Theory Sypervisor: Pavlos S. Efraimidis 1. INTRODUCTION The interaction of independent Internet flows that compete for common network resources like the bandwidth at routers has recently been addressed with tools from algorithmic game theory. In this context, the flows are considered independent players who seek to optimize personal utility functions, such as the bandwidth. The mechanism of the game is determined by the network infrastructure and the policies used. The solution concept commonly used is the Nash equilibrium (NE), i.e., a state of the game in which no player has anything to gain by unilaterally changing her strategy. An assumption underlying Nash equilibria is that each player holds the correct belief about the other players' actions. To do so, a player must know the game she is playing. However, in many situations the participants are not perfectly informed about their opponents' characteristics: firms may not know each others' cost functions, players-flows competing for bandwidth of a network may not know the exact number of active flows etc. The model of a Bayesian game [4], generalizes the notion of a strategic game to cover situations in which each player is imperfectly informed about some aspect of her environment relevant to her choice of an action. The players may be uncertain about the exact state and instead assign a probability distribution over the possible states The model that we will use, the Window-game (presented in [2]), has a router and N flows, and is played synchronously, in one or more rounds. Every flow is a player that selects in each round the size of its congestion window. The congestion window of a TCP flow is a parameter that determines the maximum number of packets that the flow is allowed to have outstanding in the network at each moment [1]. Real TCP flows adjust their congestion window to control their transmission rate. The router of the Window-game (the mechanism of the game) receives the actions of all the flows and decides how the capacity is allocated. The utility or payoff P(i) of each flow i is equal to the capacity that it obtains (transmitted packets) from the router in each round minus a cost g for each dropped packet, that is: Pi = transmittedi - g • droppedi (1) We assume that the cost g for each dropped packet is constant for a certain game and is the same for all players. The general form of the games we study is a Window-game with a router of capacity C and N players. The game is played in one round and each player i chooses its window size wi6). By definition, player X can be active or not. When we refer to player X without any other specification, we will implicitly refer to the active type of X. The same holds for the rest of this work. For g>0, player X knows that A will play at least the FS. Thus, a capacity equal to FS remains available for X. There is no incentive for X to leave the w=6 strategy, since a greater w will only reduce her payoff due to the cost of lost packets. Player's A payoff is equal to the sum of the products of her payoff at each of the two possible states with the probability of each state. It can be shown that these are the NE (wA,wX) of the game: Table 1. The NE of Game 1 g g=0 01 NE (12, x )Vx > 6 (I2,6) (x ,6)Vx > 6 (6,6) Game 2. Same as Game 1, but with an additional player B, who, just like A, is always active. Solution sketch. With an analysis similar to Game 1, it can be shown that these are the NE (wA, wB, wX) of the game: Table 2. The NE of Game 2 g 8=0 01 NE (x, y, z) Vx, y > 6 Vz > 4 (6,6,4) (x,y,4) Vx, y e [4,6 ] (4,4,4) Comments. It is clear that with MaxMin, more information leads generally to NE with better outcomes for players with this information. In many similar scenarios, the choice of g>1 generally gives fair NE. 3. A GENERAL SYMMETRIC (GS) GAME Game GS1. Game with N players in which the number of active players is uniformly distributed in {1, 2, ..., N}. In this game, each player receives a signal that informs her if she is active. Every active player assigns equal probability (p=1/N) to each of the N possible game states, where 0,1,2...,or N-1other players are active, respectively. For each of these N different cases of the Bayesian game, there is a different FS. For example if N=5 and C=60: Table 3. The possible FS values of Game GS1 n 1 2 3 4 5 FS 60 30 20 15 12 Due to the symmetric nature of the game, we will focus on its symmetric NE. Thus, we assume symmetric profiles for the game, i.e., all active players are using the same strategy w. In case of symmetric profiles, if the window size w is w>FS, then, for each flow, a number of packets equal to the FS will be transmitted, while for the remaining (w-FS) dropped packets there will be a reduction of the payoff proportional to the cost factor g. More precisely, the payoff of each flow will be: payoff = FS - (w - FS)-g = FS ■ (g +1) - w ■ g In the cases where wFS=12 and payoff = FS ■ (g +1) - w ■ g = 12 ■ (g +1) - w-g In the remaining cases with 4, 3, 2 and 1 active player(s), payoff=w, because the equivalent FS of each case is greater than any possible value of w in (12, 15]. Thus: P (w) = i w+i w +1 w +1 w +1 [12(g +1) - wg ] => Ut \ 4 12, n 1 (4-12, „ P (w) = 5 w+ — (g +1) - 5 wg = w 1—^1 + y( g +1) Working in a similar way, we obtain that: 3. For 15 , n 3 (12 + 15) ^ 2 (3 - 2g \ 27, P (w) = 5 w +V 5 g +1) - 5 wg = w I—^L 1 + — (g +1) 4. For 20 „, . 2 (12 +15 + 20), „ 3 (2-3g \ 47, „ P (w) = 5 w + i-5->-( g +1) - 5 wg = w I—^ 1 + y( g +1) 5.For 30 . , 1 (12 +15 + 20 + 30) 4 (1 - 4g ^ 77 p (w)=5 w-5->-( g+1) - 5 wg = w ^ J+—( g+1) In the above analysis we can observe how the PF changes with respect to the intervals that are defined by the possible FS values. By generalizing the above analysis to any number of players N, we obtain that: P (w ) = w f ] + - (g +1) , V w > - w l N J N N P(w) = w , Vw < — N (2) where X,Y and Z are variables, whose values depend on the parameters of the game instance. Theorem 1: In a MaxMin game with capacity C and cost g, and a random number n of active players, where n is uniformly distributed in 1,2,.., or N, the payoff function (PF) for symmetric profiles is: P (w)= -v ' N w(a - (N - a)g) + (g +1) ( —£— tf IN +1 -1 i = 1.2...N -1 C a + 1 C < w < — C P (w) = w , Vw < — v 7 N Proof: We will derive closed form expressions for the terms X, Y and Z of Equation 2. The term X is the number of the "w/N" terms, which is simply the number of the cases (game states) where wFS. So, we have shown that X+Y=N. For N players, the possible FS values define a set of N-1 consecutive intervals. For a=l,2,..,N-1, let sa be the interval between the FS's for n=a+1 and n=a active players. That is, sa is the following interval: C „C -< w < — a +1 a Note, that a gives the number of possible FS values (number of the game-states) where w < FS at the corresponding sa interval. That is because for wes„, C C w < — <- a a -1 < C C -<... < — a - 2 1 As noted earlier, the number of cases where w a - (N - a)g = 0 => g = dw N - a By solving the above equation for each of the sa intervals, we can find the value of g for which the slope of P(w) is zero. Low values of the cost factor g provide incentives for the flow to play aggressively. Consequently, for very low values of g, starting from zero, the slope of P(w) is positive for all wwk. Let k be the number of players that corresponds to the fair share FS that is equal to the best symmetric strategy wk (FSk=C/k=wk), as given by Theorem 2. Let wA=wk+x. We partition the field of possible values of wA into intervals. 2.1. Let wAs(wk)2wk]. Then x e (0, wk]. Player A will receive a payoff equal to: • wA, for each state with j=l,2,... ,k-1 active players. • [FSj - (wA - FSj)-g], for each state with j=k, k+1,..,N active players. Among the states with l,2,---,k-1 active players, the greatest congestion occurs at the state where k-1 players are active. At this worst case, all the other k-2 players choose FSk, so the remaining capacity for player A is: 2- wk > wA = wk + x Thus, at these states player A has a successful transmission of her window size. On the other hand, when k players are active and all the other k-1 players chose FSk, the remaining capacity for player A is FSk. Similarly, for all other states with j=k+1,..,N active players, player A can send successfully the FSj of each state and will suffer the cost of her wA-FSj dropped packets. The terms that we add to find the expected payoff of player A are equal to the terms that we need to add to calculate the expected payoff of symmetric profiles at the interval: C C — < w <- k k-1 If all players chose a symmetric profile in the interval (FSk,FSk-1], then they would successfully transmit their window size at the states where j=l,2,...,k-l players were active, while at all other states they would successfully transmit the FS of each state. All symmetric profiles at this interval will give a lower payoff than the symmetric profile wk, as shown in Theorem 2, and the linear function would be decreasing in this interval. Since the payoff for player A with strategy wAe(wk,2wk] when all other players stay at wk is given by the same decreasing linear function that gives the payoff of the symmetric profiles of the interval (FSk,FSk-1], we conclude that: When one player A unilaterally differentiates her strategy from the symmetric profile wk to a strategy wA, there is no wAe(wk,2wJ, such that P(wA)>P(wk). Comparing, at each possible state, player's A payoff when she chooses wA to the payoff she receives when she stays at wk (let pki be this payoff, where i is the number of active players of the state), given that all other players play wk, we notice that with wA player A receives: • Pki+x, for all states with i=1,2,..,k-1 active players • Pki-x'g» for aH states with i=k,k+l. .N active players Since P(wA) wA = wk + x • Pki + [C - wk '(i-1)] < pki + x, but with an extra penalty of g'[wA-[C-wk-(i-1)]]>0 for all states with i=k-h+l,...,k-1 active players. Let Q=g'[wA-[C-wk-(i-1)]] be this penalty. Let Y such that [C-wk-(i-1)]+Y=x. Then we will assume that she receives pki+x with an extra penalty of Z=Q+Y. (We add and subtract Y). • pid-x-g, for all states with i=k,k+l. .N active players. In these states all players receive the FSi of each state minus the cost of their dropped packets. Strategy wA loses x more packets than strategy wk. So, for wAe(h-wk,(h+1)wk], there is: • a linear increase for each extra packet, similar to the increase when wAe(wk,2wk], for all states with i=1,2,..,k-h active players; • a linear increase of the penalty for the dropped packets for each extra packet, for all states with i=k,k+l,. ..N active players; • an extra penalty, previously named as Z, for all states with i=k+h,.. ,,k-1 active players. So, comparing the payoff of player A for wA and wk, given that all other players play wk: P(wA)-P(wk)=[x(k-1 )/N] -[x-g(N-k+1 )/N] -Z, (4) where Z>0. The difference of the first two terms [x(k-1)/N]-[x-g(N-k+1)/N] does not exceed zero, as shown in Equation 3. Since we subtract a non-negative value Z from a non-positive difference, we conclude that in Equation 4: P(wA)-P(wk)<0. So, there is no wAe(h-wk,(h+1)wk], Vhe(2,3,...,k-1], such that P(wA)>P(wk). Since there is no possible deviation that can offer an improved payoff to any player who unilaterally changes her strategy, this strategy set (or else strategy profile) is a NE. 4. CONCLUSION We examined Bayesian Window-games on a MaxMin router and showed that the MaxMin queue policy leads to fair NE. The strategy profiles of the possible NE depend on the value of g. For plausible values of g (for example g=1), the NE strategy profiles utilize a sufficiently large part of the router capacity, while being at the same time absolutely fair. Finally, we believe that the NE described in Theorem 3 are the only symmetric NE of the game. In particular, we believe that we can use the machinery of Theorems 1, 2 and 3 to show this result and intend to derive a proof in our future work. 5. REFERENCES [1] Akella, A., Seshan, S., Karp, R., Shenker, S., and Papadimitriou, C. 2002. Selfish behavior and stability of the internet: a game-theoretic analysis of TCP. In Proceedings of the 2002 Conference on Applications, Technologies, Architectures, and Protocols For Computer Communications, SIGCOMM '02, 117-130.Ding, W. and Marchionini, G. 1997. A Study on Video Browsing Strategies. Technical Report. University of Maryland at College Park. [2] Efraimidis, P. S., Tsavlidis, L., and Mertzios, G. B. Window-games between TCP flows. Theor. Comput. Sci. 411, 31-33, 2798-2817, 2010 [3] Papadimitriou, C. 2001. Algorithms, games, and the internet. In Proceedings of the Thirty-Third Annual ACM Symposium on theory of Computing (Hersonissos, Greece). STOC '01. ACM, New York, NY, 749-753. [4] Martin J. Osborne, An Introduction to Game Theory, Oxford University Press, 2004, ISBN-10: 0195128958, ISBN-13: 978-0195128956. An Extensible Probe Architecture for Network Protocol Performance Measurement David Božič Univerza na Primorskem FAMNIT Glagoljaška 8, Koper, SI 6000 +386 5 615 75 70 david.bozic@student.upr.si ABSTRACT The article describes the architecture, implementation, and application of Windmill, a passive network protocol performance measurement tool which enables the experimenters can measure Windmill in a broad range of protocol performance metrics both by reconstructing application-level network protocols and by exposing the underlying protocol layers' events, the range encompasses low-level IP, UDP and TCP events such as packet corruption and length errors, duplications, drops and the level of application performance. By correlating these inter-protocol metrics, the normally hidden interactions between the layers are exposed for examination. A special designed Windmill is a passive component of the larger Internet measurement infrastructure. Unlike most tools that focus on capturing data for post analysis, Windmill was designed to support continuous passive measurements at key network vantage points. The architecture allows application-level protocol data to be distilled at the measurement point for either on-line analysis or further post analysis. The extensible architecture enables experiment managers to vary the number and scope of Windmill's experiments. Windmill is split into three functional components: • a dynamically compiled Windmill Protocol Filter (WPF), • a set of abstract protocol modules, • an extensible experiment engine. Fast WPF and the support for both experiment extensibility and high-performance protocol reconstruction are the key contributions of this architecture. Through the combination of dynamic compilation and a fast matching algorithm, Windmill's WPF can match an incoming packet with five components in less than 350 ns. Additionally, WPF addresses some fundamental limitations in past packet filtering technology by correctly handling overlapping filters. Windmill enables the dynamic placement, management, and removal of long-running experiments, while accommodating the significant demands for protocol reconstruction performance. In order for the rational growth of the Internet to continue, a deeper understanding of the interactions between its protocols is needed. Windmill can be used as an implementation of a passive application-level protocol performance measurement device and to explore these interactions in real-world settings. In the article Windmill application is presented as a set of experiments where we can measured a broad range of statistics of an Internet Collaboratory, the experiments show the ability of the Windmill to perform on-line data reduction by extracting application level performance statistics, such as measuring user's actions, and demonstrated the use of the passive probe to drive a complementary active measurement infrastructure of the internet such as servers, without modifying end-host application or operating system code. Keywords Windmill, protocol filter, packet module, passive measurement, online analysis. 1. ARCHITECTURE Windmill's functional components are: - a dynamically generated protocol filter, - a set of abstract protocol modules, - an extensible experiment engine. The organization of these components is shown in Figure 1. Figure 1. Organization of Windmill's architecture Dynamically compiled filter matches the underlying network traffic against the WPF. This filter is constructed from the set of outstanding packet subscriptions from the engine's concurrent experiments. These individual subscriptions act as a filter for each experiment, they describe which packets the experiment is interested in receiving. The abstract protocol modules provide both efficient implementations of target protocol layers and interfaces for accessing normally hidden protocol events. Modules are composed to enable the efficient execution of the protocol stack on incoming packets. The extensible experiment engine provides an infrastructure for the loading, modification, and execution of probe experiments. Additionally, the experiment engine provides interfaces for the storage and dissemination of experimental results. Experiments are loaded into the experiment engine through an interface controlled remotely by the probe's administrator, once installed in the engine, an experiment subscribes to packet streams using the abstract protocol modules. Subscriptions are passed to the protocol filter which dynamically recompiles its set of subscriptions into a single block of native machine code. This code is installed in the kernel for fast matching and multiplexing of the underlying network traffic. Upon a packet match the protocol filter sends the packet along with the set of matching experiments to the packet dispatch routine in the experiment engine. The packets are then given to each matching experiment. Each experiment then uses the abstract protocol modules to process the standard protocol information in the packet. Rather than starting with the lowest layer, usually IP, the packet is given to the highest level protocol module, e.g. HTTP or TCP. These higher level protocol modules then recursively call the lower layers of the protocol stack. After the experiment the results can extract the packet from processing any of the protocol layers. The results include the protocol frame or byte-stream service exported by the lower layers, or protocol events and error conditions triggered by the packet. The current implementation of Windmill is based on off-the-shelf software and hardware. A custom version of the FreeBSD 3.3 kernel serves as the base for the engine's software. Intel-based PCs serve as Windmill's cost-effective hardware platform. Currently, Windmill is being used with broadcast or ring-based datalink layers, including Ethernet and FDDI. 2. WINDMILL PROTOCOL FILTER Windmill Protocol filter passively examines all the underlying network traffic and performs one-to-many packet multiplexing to the probe experiments this by constructing an intermediate representation of the outstanding subscriptions in the form of a directed-acyclic graph (DAG) which is dynamically compiled to a native machine language module, and is finally installed in the probe machine's kernel. As an example, a request for TCP traffic with source port A from experiment 1 and a request for TCP traffic on source port B from experiment 2 would result in the DAG shown in Figure 2. Figure 2. Simple DAG representation of two packet subscriptions. Windmill Protocol filter can be different from past packet filters [1], where network packets are passively matched to a specification and demultiplexed to a single endpoint, in that it identifies a set of destinations for a packet. By determining a set of end-points, WPF avoids the subtle problem inherent in one-to-one matching algorithms of client starvation from overlapping filters. One-to-many matching is motivated by the fact that a probe machine may be executing numerous concurrent experiments that are interested in some of the same packet streams. As the streams of packets arrive, the filter for each experiment must be used to determine which packets are sent to which experiments. This can be done at reception time, where each packet is compared to different experiments' filters. Similar process can be using multiple BPF (Berkeley Packet Filter) devices to do the determination, one for each experiment. Packets can also be matched to experiments by determining a packet's destinations before its reception. Windmill Protocol filter adopts the latter approach in that it recomputed all possible combinations of overlapping filters when they are made, and generates a DAG to reflect these comparisons. Once the DAG is constructed, it is compiled to native machine language on-the-fly and installed in the kernel for matching against incoming packets. A message header consists of a set of comparison fields. A filter is composed of a conjunction of predicates. Each predicate specifies a Boolean comparison for a particular field. An experiment registers a filter by supplying a set of values for one or more of these comparison fields. These fields can correspond to Internet protocol specific values, e.g. IP source address or TCP destination port, or they can be subsets of these values. This allows filtering based on fields such as the first 24 bits of the IP source address, commonly used to examine packets from a specific network. Experiment Bit comparison elements IPSrcAddr IP Dst Addr Protocol Src Port Dst Port Filter 1 AS = X * P = T PS = A * Filter 2 * AD = Y P = T * PD = B Filter 3 * AD = Z P = T * PD = C Packet to match AS = X AD = Y P = T PS = A PD = B Table 1. Three overlapping packet filters and a sample input packet Table 1 shows how Windmill Protocol filter works, consider the set of filters, with each filter representing the packet subscription from one experiment. The table shows five comparison fields as the basis for three experiments. Each entry in the table specifies the value to be matched, or a "*" representing a wildcard. For example, the sample input packet above matches both filters 1 and 2. The intermediate representation of these filters as a DAG is shown in Figure 3(a). The vertices represent Boolean operations on the comparison fields; a match results in a transition to the right. Furthermore, each vertex is also labeled with the set of corresponding filters when the Boolean operation associated with the vertex is true. Consider the vertex AS = X, which is labeled with the set {1, 2, 3}. IIP Source IP Dest. Protocol Source Dest. Filter Address Address Port Port Matched (a] 11,2,3} {1,2} 11,2.3} {1,2,3} 11,2} as=x -<\ ad=y -»I p=t I->| ps=a -►! pd=b -►! 1.2