Informatica 33 (2009) 417-430 417 Transformation of XML Data into XML Normal Form Tadeusz Pankowski Institute of Control and Information Engineering Poznan University of Technology Pl. M.S.-Curie 5, 60-965 Poznan, Poland E-mail: tadeusz.pankowski@put.poznan.pl Tomasz Pilka Faculty of Mathematics and Computer Science Adam Mickiewicz University ul. Umultowska 87, 61-614 Poznan, Poland E-mail: tomasz.pilka@amu.edu.pl Keywords: XML, XML functional dependencies, XML normal forms, XML schema design Received: March 4, 2009 Normalization as a way of producing good database designs is a well understood topic for relational data. In this paper we discuss the problem of eliminating redundancies and preserving data dependencies in XML data when an XML schema is normalized. Normalization is one of the main tasks in relational database design, where 3NF orBCNF, is to be reached. However, neither of them is ideal: 3NF preserves dependencies but may not always eliminate redundancies, BCNF on the contrary - always eliminates redundancies but may not preserve constraints. In this paper we consider the possibility of achieving both redundancy-free and dependency preserving form of XML schema. We show how the XML normal form can be obtained for a class of XML schemas and a class of XML functional dependencies. We relate our approach to the decomposition algorithm proposed by Arenas and Libkin in [1]. Povzetek: Razvita je metoda pretvorbe XML podatkov v normalizirano obliko s poudarkom na odstranjevanju redundanc. 1 Introduction Normal forms, as desired forms for schemas defining structures and properties of data collections, were first proposed and investigated by Codd in early 70s. The most important of them are 3NF (Third Normal Form) [2] and BCND (Boyce-Codd Normal Form) [3]), where BCNF is a stronger definition of 3NF. Definitions of normal forms are based on the functional dependencies defined among the attributes constituting the relational schema, and specify requirements to be satisfied by the set of these functional dependencies. Then any relation that is an instance of the schema and satisfies the given set of functional dependencies, is free of harmful redundancies and anomalies. In the normalization process, the initial poor designed relational schema is decomposed into an equivalent set of well-designed schemas, i.e. into schemas in desired normal forms (usually in 3NF and often also in BCNF). Recently, as XML becomes popular as the standard data model for storing and interchanging data on the Web and more companies adopt XML as the primary data model for storing information, XML schema design has become an increasingly important issue. Thus, we observe attempts to extend normal forms and database design principles to XML databases. Research on normalization of XML data was reported in a number of papers. In [1], Arenas and Libkin extended the relational tuple-oriented definition of functional dependencies to so-called tree tuple-based functional dependencies and developed the first theory of XML functional dependencies (XFDs) and XML normal forms (XNFs). This approach was further studied by Kolahi [4, 5], and Kolahi and Libkin [6]. Integrity constraints for XML data (including keys) were studied extensively in Buneman et al. in [7], and Fan and Simeon in [8]. The equivalence between XFDs and relational FDs was investigated by Vincent et al. in [9,10]. Yu and Jagadish proposed in [11] a new XML normal form, called GTT-XNF, that is based on Generalized Tree Tuples (GTT). Central objectives of a good schema design is to avoid data redundancies and to preserve dependencies enforced by the application domain (these dependencies are formalized by means of functional dependencies). Existence of redundancy can lead not only to a higher data storage cost but also to increased costs for data transfer and data manipulation. It can also lead to update anomalies [12]. One strategy to avoid data redundancies is to design redundancy-free schema. One can start from an intuitively correct XML schema and a given set of functional dependencies reflecting some rules existing in application domain. Then the schema is normalized, i.e. restructured, in such a way that the newly obtained schema has no re- 418 Informatica 33 (2009) 417-430 T. Pankowski et al. dundancy, preserves all data (is a lossless decomposition) and preserves all dependencies. In general, obtaining all of these three objectives is not always possible, as has been shown for relational schemas [13]. However, in the case of XML schema, especially thanks to its hierarchical structure, this goal can be more often achieved [4]. 1.1 Related work An algorithm, called the Decomposition Algorithm (DA) normalizing XML schemas was proposed in [1]. The algorithm converts any DTD, given a set of XML functional dependencies (XFDs), into DTD in XML normal form (XNF). The decomposition algorithm consists of two basic operations: moving attributes, and creating new element types. These operations are performed when an XFD violating XNF is identified. Thus, the basic idea is similar to normalization of relational data, when the second normal form (2NF), the third normal form (3NF), or the Boyce-Codd normal form (BCND) are to be achieved. In the relational counterparts during the normalization process a relational schema is decomposed into a set of its projections. Thus, we can obtain a set of separate relational schemas as the result of the normalization process performed for an initial relational schema. In the case of XML documents, the result is still one XML document restructured accordingly. In [14], an information-theoretic approach to normal forms for relational and XML data has been developed. Some other papers, e.g. [5, 15, 16, 6] study XML design and normalization for native or relational storage of XML documents. In [17, 18] approaches for obtaining well-designed XML schemas from conceptual ER (Entity-Relationship) schemas have been discussed. 1.2 Contribution In this paper we describe a systematic approach to the normalization of XML data when so-called cyclic functional dependencies exist, i.e. dependencies, which in the relational case are functional dependencies of the form {A, B ^ C,C ^ A}, where A, B, C are attributes in a relational schema R(A, B, C). It is well known from relational database theory [13], that the schema R(A, B, C) is then in 3NF, but is not in BCNF. As a result, instances of R(A, B, C) have redundancies, but decomposition the schema into BCNF leads to two schemas Rl(C,A) and R2(C, B), which are free of redundancy but do not preserve dependency A,B ^ C. We focus on normalization procedure for XML schemas with cyclic XFDs. The contributions of the paper are the following: - We use a language based on tree patterns [19, 20] to express XML schemas and XML functional dependencies. This notation is used in the formal analysis of properties of XML normal form as well as the base for developing transformation algorithms. - We propose an approach to obtain XNF starting from ER schema. In the first step the ER schema is converted into an XML schema satisfying a necessary conditions (see Theorem 7.3) which is a prerequisite to successful applying of DA algorithm [1]. We show that the presence of cyclic dependencies results in a bad behavior of the DA algorithm. This paper is organized as follows. In Section 2 we introduce a running example and motivate the research. In Section 3 a relational form of the running example is considered and some problems with its normalization are discussed. Basic notations relevant to the discussed issue from the XML perspective, are introduced in Section 4. We define tree patterns and use them to formal definition of tree tuples and instances. In Section 5, tree patterns are used to define data dependencies: XML functional dependencies (XFDs) and keys (a subclass of XFDs). In Section 6, an XML normal form (XNF) is defined (according to [1]) and we show how this form can be obtained for our running example. We discuss different normalization alternatives - we show advantages and drawbacks of some schema choices. A method for transforming an XML schema into XNF is proposed in Section 7. First, for the XML schema the conceptual model in a form of ER schema is created. This schema and functional dependencies among its attributes are the basis for creating an initial XML schema satisfying the necessary condition formulated in Theorem 7.3. This schema is the subject for further normalization. Section 8 concludes the paper. 2 Redundancies in XML data -motivation example Our primary goal is to devise methods which will allow checking correctness of XML data and designing its expected (normalized) form. We expect such data to be devoid of the redundancies and immune to the update anomalies. In the case of XML schemas some redundancy problems may occur because of bad design of hierarchical structure of XML document. On the other hand, the hierarchical structure of this data can sometimes help conduct the normalization of XML data. Example 2.1. Let us consider an XML schema tree (Figure 1) that describes a fragment of a database for storing data about parts and suppliers offering these parts. Its DTD specification is given in Figure 2, and an instance of this schema is depicted in Figure 3. Each part element has identifier pId. One part may be offered by zero or more suppliers. Offers are stored in offer elements. Each offer has: offer identifier old, supplier identifier sld, price price, delivery time delivTime, and delivery place delivPlace. We assume that the following constraints must be satisfied by any instance of this schema: TRANSFORMATION OF XML DATA INTO. Informatica 33 (2009) 417-430 419 S,: pId db I art* I supplier* I ^___^^ oId sId price delivTime delivPlace Further on we will show that a special caution should be paid to such kind of dependencies as these in which participates delivPlace. In this case we have to do with "cyclic" dependencies, i.e. delivPlace depends on pId and sId (pId, sId ^ delivPlace) and sId depends on delivPlace (delivPlace ^ sId). Figure 1: Sample XML schema tree db ^ part* part ^ pId supplier* supplier ^ offer+ offer ^ old sId price delivTime delivPlace Figure 2: DTD productions describing the XML schema in Figure 1 supplier offer offer offer offer old: ol o2 o3 o4 sId: si sl s2 sl price: xl x2 x3 x4 delivTime: tl t2 t3 t4 delivPlace dl dl d2 dl Figure 3: Sample instance of schema S 3 Redundancies and dependencies in relational databases 3.1 Relational schemas and functional dependencies In relational data model, a relational schema is understood as a pair R = (U, F), where U is a finite set of attributes, and F is a set of functional dependencies over F. A functional dependence (FD) as an expression of he form X ^ Y,, where X,Y C U are subsets of U .If Y C X, then X ^ Y is a trivial FD. By F + we denote all dependencies which can be infer from F using, for example, Armstrong's axioms [21, ?]. A relation of type U is a finite set of tuples of type U. Let U = [A1, ...,An} and dom(A) be the domain of attribute A e U. Then a tuple [Ai : a1, ..,An : an], where ai e dom(Ai), is a tuple of type U. A relation R conforms to a schema R = (U, F) (is an instance of this schema) if R is of type U, and all dependencies from F + are satisfied by R. An FD X ^ Y is satisfied by R, denoted R = X ^ Y, if for each tuples r1,r2 e R holds - all offer children of the same supplier must have the same values of sId; this is similar to relational functional dependencies, but now we refer to both the values (text value of sId), and to structure (children of the same supplier). - delivPlace functionally depends on part (pId) and supplier (sId), i.e. when a supplier has two different offers for the same part (possibly with different delivTime and/or price) the delivPlace is the same - see offers o1 and o2 in Figure 3. - delivPlace functionally determines supplier (sId). It means that having a delivery place (delivPlace) we exactly now which supplier is associated to this place; although one supplier can own many delivery places. For example, in Figure 3 d1 is delivery place uniquely associated to the supplier s1. It is easily seen that schema in Figure 1 leads to redundancy: sid (and also all other data describing suppliers such as e.g.: name and address) and delivPlace are stored multiple times for a supplier. nx(ri) = nx(r2) ^ ny(ri) = ny(r2), where nx (r ) is the projection of tuple r on the set X of attributes. A key in R = (U,F) is such a minimal set K of attributes that K ^ U is in F+. Then each A e K is called a prime attribute. 3.2 Normalization of relational schemas The main task in relational schema normalization is producing such a set of schemas that posses the required form, usually 3NF or BCNF. The normalization process consists in decomposition of a given input schema. The other approach consists in synthesizing 3NF from functional dependencies [22]. Ideally, a decomposition of a schema should be lossless, i.e. should preserve data and dependencies. Let R = (U, F), U1,U2 C U, and U1U U2 = U, then schemas R1 = (U1,F1) and R2 = (U2, F2) are a lossless decomposition of R = (U, F), iff: 420 Informatica 33 (2009) 417-430 T. Pankowski et al. - The decomposition preserves data, i.e. for each instance R of R the natural join of projections of R on U1 and U2 produces the relation equal to R, i.e. R = nUl (R) (R). - The decomposition preserves dependencies, i.e. F+ = (Fi U F2)+, where Fi = {X ^ Y | X ^ Y e FAX UY C Ui}, and similarly for F2. The decomposition ((U1,F1), (U2,F2)) of (U,F) preserves data, if U1 n U2 ^ U1 e F + (or, symmetrically, Ui n U2 ^ U2 e F +) [23]. Then we say that the decomposition is determined by the functional dependence U1 n U2 ^ U1 e F+. A schema R = (U, F) is in 3NF if for every FD X ^ A e F+, holds: - X is a superkey, i.e. a key is a part of X, or - A is prime. The second condition says that only prime attributes may be functionally dependent on a set of attributes which is not a key. A schema is in BCNF if only the first condition of the two above is allowed. It means, that if whenever a set X determines functionally an attribute A, then X is a superkey, i.e. determines the whole set U. The aim of a normalization process is to develop normal forms by analyzing functional dependencies and successive decomposition of the input relational schema into its projections. In this way a well-designed schema can be obtained, where unnecessary redundancies and update anomalies had been eliminated. In practice, 3NF is accepted as the most desirable form of relational schemas It does not eliminate all redundancies but guaranties dependency preservation. On contrast, BCNF eliminates all redundancies but does not preserve all dependencies. In [6] it was shown that 3NF has the least amount of redundancy among all dependency preserving normal forms. The research adopts a recently proposed information-theoretic framework for reasoning about database designs [14]. 3.3 Relational analysis of XML schema Let us consider the relational representation of the data structure presented in Figure 1. Then we have the following relational schema: R =(U,F), where U = {old, sId,pId,price,delivTime,delivPlace}, F = {old ^ sId,pId,price, delivTime, delivPlace, sId,pId ^ delivPlace, delivPlace ^ sId}. In R there is only one key. The key consists of one attribute oId since all attributes in U functionally depends on oId. Thus, R is in 2NF and oId is the only prime (key) attribute in R. Additionally, we assume that a given supplier delivers a given part exactly to one place (pId, sId ^ delivPlace). Moreover, delivery place delivPlace is connected with only one supplier (delivPlace ^ sId). R is not in 3NF because of the functional dependency sId, pId ^ delivPlace: - sId, pId is not a superkey, and - delivPlace is not a prime attribute in U. Similarly for delivPlace ^ sId. In this case the lack of 3NF is the source of redundancies and update anomalies. Indeed, for example, the value of delivPlace will be repeated as many times as many different tuples with the same value of the pair (sId, pId) exist in the instance of R. To eliminate this drawback, we can decompose R into two relational schemas, R1 and R2, which are in 3NF. The decomposition must be based on the dependency sId, pId ^ delivPlace which guarantees that the decomposition preserves data. In the result we obtain: R1 = (U1 ,F1), where U1 = {oId, sId,pId,price, delivTime}, F1 = {oId ^ sId,pId,price, delivTime}. R2 = (U2,F2), where U2 = {sId, pId, delivPlace}, F2 = {sId,pId ^ delivPlace, delivPlace ^ sId}. The discussed decomposition is both data and dependencies preserving, since: R(U)= nui (R) (R), for every instance R of schema R, and F = (F1 U F2)+. However, we see that R2 is not in BCNF, since delivPlace is not a superkey in R2. The lack of BCNF in R2 is the reason of redundancies. For example, in table R2 we have as many duplicates of sId as many tuples with the same value of delivPlace exist in this table. R2 sId pId delivPlace si pi di si p2 di si p3 d2 s2 pi d3 We can further decompose R2 into BCNF schemas R21 and R22, taking delivPlace ^ sId as the base for the decomposition. Then we obtain: TRANSFORMATION OF XML DATA INTO. Informatica 33 (2009) 417-430 421 R21 = (U2\, F2i), where U21 = {delivPlace, sid}, F21 = {delivPlace ^ sId} R22 = (U22, F22), where U22 = {pId, delivPlace}, F22 = After applying this decomposition to R2 we obtain tables R21 and R22: R21 R22 sId delivPlace pId delivPlace si di pi di si d2 p2 di s2 d3 p3 d2 pi d3 This decomposition is information preserving, i.e. R2 = R21 ^^ R22, but does not preserve functional dependencies, i.e. F2 = (F21 U F22)+ = F21. We can observe some negative consequences of the loss of functional dependencies in the result decomposition. Assume that we insert the tuple (pi, d2) into R22. The tuple will be inserted because it does not violate any constrain imposed on R22. However, taking into account table R21 we see that supplier si (determined by d2 in force of delivPlace ^ sId) offers part pi in the place di. Thus, the considered insertion violates functional dependency sId,pId ^ delivPlace defined in R2. The considered example shows that in the case of relational databases we are not able to completely eliminate redundancies and also preserve all functional dependencies. It turns out ([6]) that the best form for relation schema is 3NF, although some redundancies in tables having this form can still remain. In next section we will show that the hierarchical structure of XML documents can be used to overcome some of the limitations of relational normal forms [24]. As it was shown in [5], there are decompositions of XML schemas that are both information and dependency preserving. In particular, we can obtain a form of XML schema that is equivalent to BCNF, i.e. eliminates all redundancies, and additionally preserves all XML functional dependencies. 4 XML schemas and instances Schemas for XML data are usually specified by DTD or XSD [25, 26]. In this paper an XML schema (a schema for short) will be specified by means of a slightly simplified version of DTD. We will assume that both attributes and elements of type #pcdata will be represented by so called terminal elements labeled with terminal labels and having text values. Additionally, terminal elements may have only one occurrence within children of one non-terminal element. Definition 4.1. Let L be a set of labels, Ter c L be a set of terminal labels and r e L — Ter be a root label. Let p be a set of productions of the form: l ^ a, where: - l e L — Ter; - r does not occur on the right-hand side of any production; - a is a regular expression over L — {r} defined asfol-lows: a ::= 3 \ y 3 ::= l \ 3? \ 3 *\ 3+ Y ::= A \ A? \ 3 \ Y Y \ y\y l e L — Ter — {r}, A e Ter. Then the quadruple S = (r, L, Ter, p), is an XML schema. XML schemas will be often represented as XML schema trees (Figure 1) (or as XML schema graphs when the schema is recursive). In this paper we restrict ourselves to non-recursive schemas. An example of XML schema, corresponding to the schema tree in Figure 1, was given in Figure 2, where: db is the root, part, supplier, and offer are non-terminal labels, andpid, oid, sid, price, delivTime, delivPlace are terminal labels. The following notion of tree patterns [19] will be useful to define tree tuples, tree formulas and XML functional dependencies. Definition 4.2. Let L be a set of labels, and Ter c L be its subset of terminal labels. A tree pattern is an expression conforming to the syntax: e ::= A \ l \ l/e \ l[e1, ...,ek ] \ l[ei, ..., ek\/e, where A e Ter, l e L — Ter, n < 1. To denote that a tree pattern $ is build using the set {Ai,...,An} of terminal labels, we will write $(Ai,...,An). Definition 4.3. Let $(A\,..., An) be a tree pattern, and x\,... ,xn be text-valued variables. Then the expression $(Ai : xi,..., An : Xn), is a tree formula. 422 Informatica 33 (2009) 417-430 T. Pankowski et al. Definition 4.4. Let ^(Ai : xi,... ,An : xn) be a tree formula, and w be a variable valuation, i.e. a function w : jxi,. .., xn} ^ Str U where Str is a set of text values, and ± is a distinguished null value. Then t = 4>{Ai : xi,...,An : xn)(w) = 0(Ai : w(xi), ...,An : w(xn)) is called a tree tuple of type Ai,... ,An) with values (Ai : w(xi), ...,An : w(xn). Without loss of generality, a tree tuple of type ..., An) will be denoted also as: t = Ai : ai,...,An : an), t = $(Ai,...,An)(w), t = j>(w(Ai),...,w(An)), t = $(Ai : t.Ai,...,An : t.An). Definition 4.5. Let t be a tree tuple of type ), and let U' = {Ai,..., An} C U. The projection of t on U', denoted nu'(t), is the set of fields (Ai : t.Ai) occurring in t, i.e.: nu'(t) = {Ai : t.Ai ,..., An : t.An}. We assume that there is a function type(t) that for a tree tuple t returns its type, it will be denoted by type(t) = ^(Ai,...,An). An XML database consists of a set of XML data, and an XML data is an instance of an XML schema. It will be convenient to distinguish between two kinds of instances: - an instance as a set of tree tuples, referred to as a canonical instance, - an instance as a labeled tree, referred as an XML tree. For one canonical instance there may be a number of XML trees representing it. In the set of all such XML trees we will be interested in such that have a required form, so called XML normal form (XNF). The canonical instance as well as all corresponding XML trees must conform to the given XML schema. Definition 4.6. The set of tree tuples D = {ti,...,tN }, is a canonical instance of an XML schema S = (r, L, Ter, p) if the type, type(t), of any tuple t £ D conforms to S. The conformance of a tree tuple type to an XML schema is defined as follows. Definition 4.7. Let Ai,..., An) be the type of a tree tuple t. This tree pattern conforms to the XML schema S = (r, L, Ter, p), if: 1. 4'(.Ai,..., An) = r[e], and 2. e conforms to p according to r. The conformance of a pattern e to the set of productions p according to a label l, is defined as follows: - if e is Ai, then Ai G Ter, and there is such the production l ^ a in p that Ai occurs in a; - if e is l je', then there is such the production l ^ a in p that l' occurs in a, and e' conforms to p according to l'; - if e is l'[ei,..., en], then there is such the production l ^ a in p that l' occurs in a, and every pattern ei, 1 < i < n, conforms to p according to l'. - if e is l'[e\,..., en]je', then l'[e\,... ,en] must conform to p according to l, and e' must conform to p according to l'. We define XML tree as an ordered rooted node-labeled tree over a set L of labels, and a set Str U {±}, which elements are used as values of terminal nodes. Definition 4.8. An XML tree I is a tuple (root, Ne, N1, child, X, v), where: - root is a distinguished root node, Ne is a finite set of non-terminal element nodes, and N* is a finite set of terminal nodes; - child C ({r} U Ne) x (Ne U N*) - a relation introducing tree structure into the set {r}UNeUNwhere r is the root, each non-terminal node has at least one child, terminal nodes are leaves; - X : {root} U Ne U N* ^ L - a function labeling nodes with labels; - v : N * ^ Str U {±} - a function assigning terminal nodes with values. Definition 4.9. We say that an XML tree I = (root, Ne, N*, child, X, v) conforms to an XML schema S = (r, L, Ter, p), denoted I = S, if - X(root) = r; if n G Ne, then X(n) G L — Ter; if n G Nthen X(n) G Ter; - if X(n) = l, l ^ a G p, and n\,..., nk are children of n, then the sequence X(n\) ■ ■ ■ X(nk) is a word in the language generated by a. It will be useful to perceive an XML canonical instance D with tuples of type $ as a pair (^, Q) (called a description), where Q is a set of valuations for Example 4.1. For the instance I\ in Figure 3 we have: ($(pId, oId, sId,price, delivTime, delivPlace) {(p1, o1, s1, x1, t1, d1), (p1, o2, s1, x2, t2, d1), (p1, o3, s2, x3, t3, d2), (p2, o4, s1, x4, t4, d1)}). An XML tree I satisfies a description ($, Q), if the root of I satisfies $ for every valuation w g Q, where: TRANSFORMATION OF XML DATA INTO. Informatica 33 (2009) 417-430 423 1. (I, root) = (r[e],w), iff 3n £ Ne child(r,n) A (I,n) = (e,w); 2. (I, n) = (A, w), iff X(n) = A and v(n) = w(A). 3. (I,n) = (l/e,w), iff X(n) = l and 3n' £ Ne child(n, n') A (I, n') = (e, w) 4. (I, n) = (l[e\,ek], w), iff X(n) = l and for each i, 1 < i < k, exists ni such that child(n, ni) A (I, ni) = (ei,w)); 5. (I,n) = (l[eu..,ek]/e,w), iff (I,n) = (l[e\,.., ek],w) and exists n' such that child(n, n') A (I, n') = (e, w)). A description (0, Q) represents a class of 0 instances with the same set of values (the same Q), since elements in instance trees can be grouped and nested in different ways. □ B C C D D b cl c2 d1 d2 Figure 4: A simple XML tree For example, the XML tree in Figure 4 satisfies (among others) the following two descriptions (01: Qi), and (02, Q2), where: 01(B,C ) = /A[B,C], Qi = {(b,cl), (b, c2)}; 02(B,C,D) = /A[B,C,D], Q2 = {(b,cl,dl), (b,c2,d1)}. 5 XML functional dependencies and keys Over an XML schema we can define some constraints, which specify functional dependencies between values and/or nodes in instances of the schema. These constraints are called XML functional dependencies (XFD). Definition 5.1. A tree pattern of the form 0(A1,...,Ak )/A conforming to an XML schema S = (r, L, Ter, p) is an XML functional dependency (XFD) over S. Then we say that the set of paths terminating in (A1y... ,Ak) and satisfying 0, determines the path terminating in A and satisfying 0/A. Satisfaction of an XFD is defined against an canonical instance of XML schema. Definition 5.2. Let S be an XML schema, D be a canonical instance of S and f = 0(A1,... ,Ak )/A be an XFD on S. We say the D satisfies f, denoted D = f if for any two tree tuples t1,t2 £ D the following implication holds nAi,...,Ak (t1) = nAi,...,Ak (t2) ^ n A (t1) = n A (t2^). Assume that A1,...,Ak,A terminates, respectively, paths p1,...,pn,p. Then the XPath-oriented XFD 0(A1,.. .,Ak)/A corresponds to the following path-oriented XFD defined in [1]: P1.A1 ,...,pk .Ak ^ p.A. The XPath-oriented definition has the following advantages: - it is easy to check, using only the XPath semantics, whether an XML tree satisfies the XFD or not (see below), - this form of XFD can be used to generate an XQury program performing some transformation operations (see the next section). An XFD f can be interpreted as XPath expression, i.e. f(x1,...,xk) := 0(A1 : X1,...,Ak : Xk)/A, that for a given valuation w of its variables returns a sequence of objects (nodes or text values). An XML tree I = (S, Q) satisfies an XFD f (x) if for each valuation w £ Q of its variables, f (w(x)) evaluated on I returns an empty sequence or a singleton , i.e. count(f (w(x))(I)]) < 1, where [expr}(I) is the result of evaluation expr on the instance I. An XML tree I = (S, Q) satisfies an XFD f if for each valuation w £ Q, f (w) evaluated on I returns an empty sequence or a singleton, i.e. count([f(w)(I)]) < 1, where [expr}(I) is the result of evaluation expr on the instance I. Example 5.1. Over S1 the following XFDs can be defined: f1(oid) = /db/part[supplier/of fer/oId], f2(oid) = /db/part[supplier/of fer/oId]/pId, fs(pid, sid) = /db/part[pId]/supplier/ of fer [sId]/delivPlace, fi(deliv Place) = /db/part/supplier/of f er[ delivPlace]/sId, fs(pid, sid) = /db/part[pId]/supplier[offer/sId]. According to XPath semantics [27] the expression f1(oid) by a valuation w, is evaluated against the instance I1 (Figure 3) as follows: (1) first, a sequence of nodes of type /db/part is chosen; (2) next, for each selected node the predicate [supplier/offer/oId = w(oid)] is tested, this predicate is true in a node n, if there exists a path of type supplier/offer/oId in I1 leading from n to a text node with the value w(oid). We see that count([Lf1(w(oid))\) equals 1 for all four valuations satisfied by I1, i.e. for oid ^ o1, oid ^ o2, oid ^ o3, and oid ^ o4. 424 Informatica 33 (2009) 417-430 T. Pankowski et al. Similarly, execution of f2(oid)(I) gives a singleton for any valuation of oid. These singletons are text values of the path /db/part/pId, where only nodes satisfying the predicate [supplier/of f er/oId = w(oid)] are taken from the set of nodes determined by /db/part. So, also this XFD is satisfied by Ii. However, none of the following XFDs is satisfied in Ii. gi(sid) = /db/part[supplier/of fer/sId], g2(pid) = /db/part[pI d]/supplier/of f er/sI d g^(pid) = /db/part[pI d]/supplier/of f er g^(pid, sid) = /db/part[pI d]/supplier/of fer[sId], g5 (delivPlace) = /db/part/pId/supplier/ offer [delivPlace]. Evaluating the above XFDs against Ii, we obtain, for example. count(lgi(sid)([sid ^ s1])(Ii)]) = 2, count(l_g2(pid,)([pid, ^ p1])]) =2, count(l_g3(pid)([pid ^ p1])]) =3. An XFD can determine functional relationship between a tuple of text values of a given tuple of paths and a path denoting either a text value (e.g. f2(oid)) or a subtree (a node being the root of the subtree) (e.g. fi(oid)) Ü the latter XFDs will be referred to as XML keys. Definition 5.3. A functional dependence f = $(Ai,...,Ak), where f is of type q/l and l is a non-terminal label, is an XML key for subtrees of the type q/l. Another notation for XML keys has been proposed in [7]. In that notation (db.part, {pId}) is an absolute key saying that a subtree of type db/part is uniquely determined by the path db/part/pid (this constraint holds in the instance Ii in Figure 3). In our XPath-oriented notation this key is the following XFD. db/part[pid]. An example of a relative key ([7]) is (db.part, (supplier, {of fer.sId})), that says that that in the context of db/part, a tree supplier is determined by the path offer/sId. In our XPath-oriented notation this key is expressed as follows. db/part [pid]//supplier [sid]. 6 Normal form for XML Definition 6.1. Let S be an XML schema and £ be a set of XFDs defined over S. (S, £) is in XNF iff for any XFD 4>(AU ..., Ak)/A e (S, £)+, also ...,Ak) e (S, £)+,where (S, £)+ is a set of XFDs being consequences of (S, £). In other word, if ^(Ai, ...,Ak )/A is an XFD for q/l/A, then Ai,..., Ak) is a key for q/l. It means that there is at most one subtree of type q/l for any different valuation of (Ai,..., Ak) in if the value of a child A of q/l is determined by this valuation. In this way the redundancy, i.e. repetition of values in different subtrees of type q/l, is eliminated. Let us consider the schema S2 in Figure 5 and its instance I2 in Figure 6. S2: offer+ delivPlace old price delivTime Figure 5: Restructured form of schema in Figure 1 I,: offer offer I I old: o1 o2 price: x1 x2 delivTime: t1 t2 x3 t3 x4 t4 Figure 6: Instance of schema in Figure 5 The following XFD over S2 /db/part/supplier[deliv Place]/sId says that delivery place (delivPlace) determines the supplier (sId). However, S2 is not in XNF, since its instance I2 does not satisfy the key /db/part/supplier [delivPlace]. To eliminate redundancies in XML documents, some normal forms (XNF) for XML schemas have been proposed [1, 24, 4, 5]. In this paper we will define XNF in the spirit of BCNF defined for relational schemas. This approach was also used in [1]. It means that I2 (Figure 6) is not free of redundancy (there are two different subtrees of type supplier describing the same supplier, i.e. its possible name, address, etc.). In the case of schema S3 (Figure 7) the corresponding XFD and the key are: TRANSFORMATION OF XML DATA INTO. Informatica 33 (2009) 417-430 425 pId offer+ delivPlace old price delivTime Figure 7: Restructured fomi 0 schema in Figure 1 L: Rb offer offer I I oId: o1 o2 price: x1 x2 delivTime: t1 t2 x4 t4 x3 t3 Figure 8: Instance of schema in Figure 7 db/supplier[part/delivPlace]/sId db/supplier[part/deliv Place]. These constraints are satisfied by I3 (Figure 8). We see that this time, there is only one subtree of type supplier for any value of delivPlace. The other dependency of interest is sId, pId ^ delivPlace. Its specification with respect to S2 and S3 is as follows: db/part [pId]/supplier [sId] /delivPlace, and In Figure 9 there is schema S4 that is in XNF. To make the example more illustrative, we added node name to part data. Also the instance in Figure 10 was slightly extended as compared to instances I2 and I3. db/supplier[sId]/part[pId\/delivPlace. It is easy to see then if these XFDs hold, then also keys db/part [pId]/supplier [sId], and db/supplier [sId] /part [pId] are satisfied. However, neither S2 nor S3 is in XNF. We have already shown that there is redundancy in instances of S2. Similarly, we see that also in instances of S3 redundancies may occur. Indeed, since one part may be delivered by many suppliers then the description of one part may be multiplied under each supplier delivering this part, so such data as part name, type, manufacturer etc. will be stored many times. It results from the fact that satisfaction of db/supplier/part[pid\/pname would not imply the satisfaction of db/supplier/part[pid]. suppliers I supplier* I sId part* / \ pId delivPlace parts offers I I ^jpart* offer* pId name old sId pId price delivTime Figure 9: XNF schema for schémas S1, S2, and S3 suppliers db I parts suppliers^ offers supplier sId s2 delivPlace delivPlace delivPlace di d1 d3 parts part part / \ / \ pId name pId name pi n1 p2 n2 offers part pId p1 delivPlace d2 part / \ pId name p3 n3 offer offer offer offer oId: oi o2 o3 o4 sId: s1 si s2 si pId: p1 Pi Pi p2 price: xi x2 x3 x4 delivTime: ti t2 t3 t4 Figure 10: Instance of schema in Figure 9 XSD (XML Schema Definition) for £4 in notation proposed in [26] is shown in Figure 11. Note that we cannot use DTD since there are two subtrees labeled part, where each of them has different type: the part subtree under supplier consists of pId and delivPlace, whereas the part subtree under parts consists of pId and name. Recall that in the case of DTD each non-terminal symbol (label) can have only one type (definition), i.e. can appear on the left-hand side of exactly one production rule [26]. This difficulty might be overcame by introducing new labels (for example partDesc) for full descriptions of a parts. It can be shown that £4 satisfies the condition of XNF. Thus, this schema is both redundant-free and dependency S 4 I 4 426 Informatica 33 (2009) 417-430 T. Pankowski et al. db db[content] content suppliers[suppliers], parts[parts], offers[of fers] suppliers supplier[supplier]* parts partsparti]* of fers offers[off er]* supplier sId, part[part2] part2 pId, delivPlace parti pId, name offer oId, sId, pId, price, delivTime Figure 11: An XSD describing the XML schema in Figure 9 preserving. However, as we will show in the next section, schema S4 is not the best XNF for the considered running example. XML normal form (XNF). Our method is based on analyzing functional dependencies among attributes of entities involved in the ER schema. In Figure 12 there is an ER schema corresponding to the XML schema considered in previous sections (see S1 in Figure 1) with two additional attributes sname (supplier name) and pname (part name). delivPlace is treated as an entity with the key attribute did. 7 Transforming XML schemas to XML normal form In the previous section we discussed an example of transforming an XML schema into XNF. We started with the schema Si in Figure 1, and the final schema was S4 in Figure 9. However, the final schema has been created in a rather intuitive way. Thus, although it is in XNF it is not clear whether there exists another XNF reformulation of S1, maybe better than X4, or not. A systematic algorithm (the Decomposition Algorithm, DA) for normalizing an XML schema was proposed in [1]. If we apply DA to S1, we obtain a schema that is worse than S4. It is so due to the following reasons: 1. In DA we always obtain a normal form relative to a context path, i.e. the XNF is restricted to the subtree determined by this context path. Only if the context path is equal to the root label (db in our case), the XNF is absolute. 2. In DA it is not possible to change ordering of elements on a path, because we can only move and discard attributes or create new element types [1]. For example, the part element precedes supplier (is above it) in S1 and in the result schema this ordering must remain unchanged. But then the absolute XNF for S1 cannot be achieved. 3. The result of normalization strongly depends on the starting XML schema. 7.1 Transforming ER model to XNF In this section we will discuss how to transform an ER (Entity-Relationship) schema [28] into XML schema in Figure 12: ER schema corresponding to S1 The following functional dependencies are captured by the ER schema in Figure 12: oid ^ sid,pid, did sid,pid ^ did did ^ sid sid ^ .sname pid ^ pname oid ^ price, delivTime (1) The first three of them are of particular importance because they state constraints between entities (key attributes of entities). We will proceed in two steps: 1. An initial XML schema is created using the entities names from the ER schema, their attributes and functional dependencies between key attributes. The initial schema must follow the necessary condition formulated in Theorem 7.3. Satisfaction of this condition is the prerequisite to obtain an XML schema in absolute XNF in the next step. 2. The DA algorithm [1] is applied to obtain the final XNF schema. In fact, only the step called Creating New Element Types from this algorithm is to be applied. In this way all violations caused by functional dependencies over non-key attributes (appearing on the right-hand sides of these dependencies) are resolved. Such functional dependencies in our example are for example the three last in (1). Definition 7.1. Let ...,An)/B be an XFD of type q/l/B over an XML schema S. S is called XNF-consistent with ^(A1;..., An)/B, if for any instance I of S holds the implication: mergeq/i(I) = . ,An), TRANSFORMATION OF XML DATA INTO. Informatica 33 (2009) 417-430 427 where mergeq/i (I) is the result of merging subtrees of type q/l in I. Let I be an XML tree and T = (r, t) be a subtree of type q/l of I, i.e. T belongs to the result of evaluation of the path expression q/l on the instance I, T e [q/l(I)]. Then r is a list of the form r := (Ai : ai:... ,An : an), and t is a list of subtrees (each of these lists may be empty). Definition 7.2. Two subtrees Ti = (ri,Ti), and T2 = (r2,T2) of type q/l of an instance I are joinable if ri = r2, and then: - (r,Ti) ix (r,T2) = (r,Ti\\T2), where t1\\t2 is concatenation of lists t1 and t2; - the merge operation on all subtrees of type q/l of an instance I is defined as follows: mergeq/l (I) = for each TiT e [q/l(I)] if Ti and T2 are joinable then I := I - q/l[Tl] - q/l[T2] U q/l[Ti H T2]. Example 7.1. Let S be defined by l — li* 11 — Ai B l2* 12 - A2 C and let terminal labels Ai,A2 functionally determine B, i.e. Ai, A2 — B. Then the corresponding XFD is $(Ai, A2)/B := l/h[Ai,l2[A2]]/B. We see that I (Figure 13) satisfies ^(Ai,A2)/B, i.e. for the values of the pair of paths (l/li/Ai,l/li/l2/A2) the value of the path l/h/B is uniquely determined. Similarly, the merged form of I, mergel/ll (I), satisfies A1,A2), i.e. the pair (l/li/Ai, l/li/l2/A2) of path values uniquely determines the node of type l/li. Note that I in its un-mergedform does not satisfy ^(Ai, A2), because there are two nodes of type l/li corresponding to the same pair of path values of the type (l/li/Ai,l/li/l2/A2). Theorem 7.3. Let f := $(Ai; ...,An)/B be an XFD of type q/l/B over an XML schema S. Then S is XNF-consistent with f if the set consisting of labels of all terminal children of q/l is functionally dependent on the set of terminal labels occurring in f. Proof. We will proof the theorem by contradiction. Let us assume that there exists a terminal child of l labeled C and C is not functionally dependent on the set [Ai,..., An} of terminal labels occurring in f. Then the tree of the form I = Ai : ai,..., An : an) U[q/l[B : b,C : ci],q/l[B : b,C : C2]}, is a subtree of an instance of S in which ... ,An)/B is satisfied. However, this subtree violates &(Ai,... ,An). This violation follows from the fact that there are two subtrees rooted in q/l, one with terminal children (B : b, C : ci), and the other with terminal children (B : b,C : c2 ), so this subtrees cannot be merged. Thus S is not XDF-consistent with f. □ In the schema in the following example the necessary condition formulated in Theorem 7.3 does not hold. Example 7.2. Let S be defined by l — li * 11 — Ai l2* 12 - A2 B C and let terminal labels Ai,A2 functionally determine B, i.e. Ai,A2 — B. Then the corresponding XFD is t(Ai,A2)/B := l/li[Ai]/l2[A2]/B. Assume that C does not depend on [Ai,A2}. Schema S is not XNF-consistent with $(Ai,A2)/B, since we have (see Figure 14): I = l/li(Ai : ai)/l2(A2 : a2) U[l/li/l2((B : b,C : ci), (B : b,C : <%))} = mergei/ii/i2 (I) Thus, I violates Ai,A2) before and after application of the merging operation, in both cases there are two nodes of type l/li /l2 corresponding to the same pair of values (ai,a2) of the pair of paths (l/li/Ai,l/li/l2/A2). I: A /l\ 4 B 12 a b /\ a2 c 11 /l\ 4 B 12 a b /\ a2 c mergemi{iy. 1 I /1 4 B 12 a b /\ 12 /\ a^ c a^ c Figure 13: Instance I of the schema from Example 7.1 and its merged form The following theorem formulates the necessary condition for XML schema to be XNF-consistent with an XDF defined over this schema. 4 4 A B b C A B b C Figure 14: Instance of the schema from Example 7.2 that satisfies ^(Ai,A2)/B and violates $(Ai,A2) Now, using the Theorem 7.3 and DA, the transformation of ER into XNF is realized in the following two steps: 1. We start with functional dependencies defined over key attributes of entities modeled by ER schema. Following the requirements of Theorem 7.3 we obtain the i 2 2 a 2 2 a c a c a c, a c 2 2 2 2 2 2 428 Informatica 33 (2009) 417-430 T. Pankowski et al. following DTD for ER schema in Figure 12: db — supplier* .supplier — sid sname part* part — pid pname did of fer* offer — oid price delivTime (2) Now, the remaining functional dependencies specified in (1) must be taken into account. 2. The DTD obtained in the first step is the subject of the decomposition by means of the DA algorithm. It is easily seen that the XFD corresponding to pid — pname is anomalous. The step Creating new element types of DA converts (2) into db — supplier * partDesc* supplier — sid sname part* partDesc — pid pname (3) part — pid did offer* offer — oid price delivTime Applying the above steps to the ER schema from Figure 12 gives the XML schema in XNF depicted in Figure 15 and its instance in Figure 16. S5: db supplier sid sname part pname pid did offer* oid price delivTime Figure 15: The result of transformation ER schema from Figure 12 to XML schema in XNF 8 Conclusion In this paper, we discussed how the concept of database normalization can be used in the case of XML data. Normalization is commonly used to develop a relational schema free of unnecessary redundancies and preserving all data dependencies existing in application domain. In order to apply this approach to design XML schemas, we introduced a language for expressing XML functional dependencies. In fact, this language is a class of XPath expressions, so its syntax and semantics are defined precisely. We define the notion of satisfaction of XML functional dependence by an XML tree. To define XNF we use the approach proposed in [24]. All considerations are illustrated by the running example. We discuss various issues connected with normalization and compare them with issues faced in the case of relational databases. We show how to develop redundancy-free and dependency preserving XML schema. It is worth mentioning that the relational version of the schema cannot be structured in redundancy-free and dependency preserving form. In this case, preservation of all dependencies requires 3NF but then some redundancy is present. Further normalization to BCNF eliminates redundancies but does not preserve dependencies. In the case of XML, thanks to its hierarchical nature, we can achieve both properties. However, it is not clear if this is true in all cases (see e.g. [5]. We have proposed a method for normalizing XML data in to steps. First, we build a conceptual model by means of ER schema and specify all functional dependencies among its attributes. Following the necessary condition formulated in Theorem 7.3, an initial XML schema is created. This schema is XNF-consistent with all XML functional dependencies under consideration. Such the schema can be further normalized, for example using the decomposition algorithm (DA) [1]. It was shown that in the presence of cyclic functional dependencies the procedure proposed in DA results in bad design (only a local XNF can be obtained, i.e. only subschemas of the schema can be transformed into XNF). Acknowledgement: This work was supported in part by the Polish Ministry of Science and Higher Education under grant NN516 369536. References [1] M. Arenas and L. Libkin, "A normal form for XML documents." ACM Trans. Database Syst., vol. 29, pp. 195-232, 2004. [2] E. Codd, "Further normalization of the data base relational model," R. Rustin (ed.): Database Systems, Prentice Hall, and IBM Research Report RJ 909, pp. 33-64, 1972. [3] E. F. Codd, "Recent investigations in relational data base systems," in IFIP Congress, 1974, pp. 10171021. [4] S. Kolahi, "Dependency-Preserving Normalization of Relational and XML Data," Database Programming Languages, DBPL 2005, Lecture Notes in Computer Science, vol. 3774, pp. 247-261, 2005. [5] -, "Dependency-preserving normalization of relational and XML data," Journal of Computer and System Sciences, vol. 73, no. 4, pp. 636-647, 2007. [6] S. Kolahi and L. Libkin, "On redundancy vs dependency preservation in normalization: an information-theoretic study of 3NF," in PODS '06. ACM, New York, USA, 2006, pp. 114-123. [7] P. Buneman, S. B. Davidson, W. Fan, C. S. Hara, and W. C. Tan, "Reasoning about keys for XML," Information Systems, vol. 28, no. 8, pp. 1037-1063, 2003. TRANSFORMATION OF XML DATA INTO. Informatica 33 (2009) 417-430 429 supplier sid sname part si snL pid did pi di offer part oid price delivTime , o2 x2 t2 piddld p2 di pid pi sid sname part s2 sn2 pid did offer x pi d2/\ \ offer oid price delivTime / I \ o3 x3 t3 partDesc / \ pname pid pnl p2 pname pn2 oid price delivTime ol xl tl oid price delivTime o4 x4 t4 Figure 16: Instance of S5 from Figure 15 [8] W. Fan and J. Sim6on, "Integrity constraints for XML," J. Comput. Syst. Sci., vol. 66, no. 1, pp. 254291,2003. [9] M. W. Vincent, J. Liu, and M. K. Mohania, "On the equivalence between FDs in XML and FDs in relations," Acta Inf., vol. 44, no. 3-4, pp. 207-247, 2007. [10] M. W. Vincent and J. Liu, "Checking functional dependency satisfaction in XML," Database and XML Technologies - XSym 2005, Lecture Notes in Computer Science, vol. 3671, pp. 4-17, 2005. [11] C. Yu and H. V. Jagadish, "XML schema refinement through redundancy detection and normalization," VLDB Journal, vol. 17, no. 2, pp. 203-223, 2008. [12] R. Elmasri and S. B. Navathe, Fundamentals of Database Systems. Redwood City: The Benjam-in/Cummings, 1994. [13] S. Abiteboul, R. Hull, and V. Vianu, Foundations of Databases. Reading, Massachusetts: Addison-Wesley, 1995. [14] M. Arenas and L. Libkin, "An information-theoretic approach to normal forms for relational and XML data," J. ACM, vol. 52, no. 2, pp. 246-283, 2005. [15] P. A. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger, and J. Teubner, "Pathfinder: XQuery -the relational way," in VLDB 2005. ACM, 2005, pp. 1322-1325. [16] S. Pal, I. Cseri, O. Seeliger, M. Rys, G. Schaller, W. Yu, D. Tomic, A. Baras, B. Berg, D. Churin, and E. Kogan, "XQuery implementation in a relational database system," in VLDB 2005. ACM, 2005, pp. 1175-1186. [17] P. Pigozzo and E. Quintarelli, "An algorithm for generating XML Schemas from ER Schemas," in SEBD, 2005, pp. 192-199. [18] R. Schroeder and R. dos Santos Mello, "Improving query performance on XML documents: a workload-driven design approach," in DocEng. ACM, 2008, pp. 177-186. [19] W. Xu and Z. M. Özsoyoglu, "Rewriting XPath queries using materialized views," in VLDB 2005, 2005, pp. 121-132. [20] T. Pankowski, "XML data integration in SixP2P -a theoretical framework," in EDBT Workshop Data Management in P2P Systems (DAMAP 2008), ACM Digital Library, 2008, pp. 11-18. [21] W. W. Armstrong, "Dependency structures of data base relationships," in IFIP Congress, 1974, pp. 580583. [22] P. A. Bernstein, "Synthesizing third normal form relations from functional dependencies," ACM Transactions on Database Systtems, vol. 1, no. 4, pp. 277298, 1976. [23] J. Rissanen, "Independent components of relations," ACM Transactions on Database Systems, vol. 2, no. 4, pp. 317-325, 1977. [24] M. Arenas, "Normalization theory for XML," SIG-MOD Record, vol. 35, no. 4, pp. 57-64, 2006. [25] XML Schema Definition Language (XSDL) 1.1, 2007, www.w3.org/TR/xmlschema11-1. [26] W. Martens, F. Neven, and T. Schwentick, "Simple off the shelf abstractions for XML schema," SIGMOD Record, vol. 36, no. 3, pp. 15-22, 2007. 430 Informatica 33 (2009) 417-430 T. Pankowski et al. [27] XML Path Language (XPath) 2.0, 2006, www.w3.org/TR/xpath20. [28] P. P. Chen, "The entity-relationship model - toward a unified view of data," ACM Transactions on Database Systtems, vol. 1, no. 1, pp. 9-36, 1976.