Informatica 41 (2017) 305–315 305 Accelerating XML Query Processing on Views Yin-Fu Huang and Yu-Hsien Cho National Yunlin University of Science and Technology 123 University Road, Section 3, Touliu, Yunlin, Taiwan 640, R.O.C. E-mail: huangyf@yuntech.edu.tw, http://mdb.csie.yuntech.edu.tw Keywords: XPath, labeling schemes, materialized views, T-Bitmap, tag index, value index, navigation, twig query Received: April 13, 2016 With the widespread use of the eXtensible Markup Language (XML), more and more applications store and query XML documents in XML database systems. Thus, how to efficiently process a query and find the specified patterns conforming the query from XML documents is a crucial issue. In this paper, some processing methods are employed on XML documents to improve document retrieval. First, a materialized view is built from an original document for each query. Then, on each materialized view, auxiliary structures such as T-Bitmap and indexes are also built to further accelerate query processing. Finally, four experiments are conducted to show the superiority of the proposed approach. Povzetek: Predstavljena je metoda za hitrejše iskanje po bazah XML dokumentov. 1 Introduction Since XML (eXtensible Markup Language) was widely used to exchange data over the web, more and more applications store and query XML documents in XML database systems. Different from other data formats, an XML document is composed of elements and values with a nested structure, and could be modeled as a tree structure. XPath and XQuery are the standard XML query languages proposed by W3C. They can be used to describe patterns with specified predicates on multiple elements with tree structured relationships. However, how to efficiently process a query and find the specified patterns conforming the query from XML documents is a crucial issue. In the past, different methods have been proposed in querying XML documents. One of research directions was to build materialized views on XML documents. The goal is to reduce the number of visited nodes during tree traversing by searching from the root of a materialized view, rather than from the root of an original XML document tree. Another research direction was to construct index or access methods to query XML documents for facilitating query processing. In this paper, we integrate the methods from these two research directions as our motivation for accelerating XML query processing on views. The reason is that the performance of a materialized view is better than a non-materialized view because not only these data can be accessed without re-materialization, but also they can be fetched faster by building indexes on these data beforehand. Besides, a materialized view is usually used in accessing a large amount of data, such as data warehouse applications, in support of management’s decision-making process through OLAP queries, almost read operations In short, the motivation is for decision makers to accelerate XML query processing in a data warehouse. In summary, we highlight the contributions of this paper as follows: 1) In this study, we build materialized views from an XML document for each query to reduce the search space of queries, and also build auxiliary structures such as T-Bitmap and indexes to further accelerate query processing. 2) Comprehensive experiments are conducted to verify the superiority of the proposed approach. 3) The space vs. time issue is explored when multiple materialized views are integrated together to save the space. The remainder of this paper is organized as follows. Section 2 presents the previous work proposed in querying XML documents. In Section 3, basic concepts such as query processing and materialized views on XML documents are introduced. In Section 4, we propose a system architecture consisting of view processing and query processing. In Section 5, four experiments are conducted to show the superiority of our approach. Finally, we make conclusions in Section 6. 2 Previous work As mentioned in Section 1, one research direction on querying XML documents was to build materialized views on XML documents to reduce the number of visited nodes during tree traversing, thereby leading to faster query processing. Godfrey et al. [1], and Murthy and Banerjee [2] proposed SQL/XML syntax for query processing on views, whereas Halevy [3] and Jayavel et al. [4] proposed various query syntax such as join to handle views and focused on the problem of evaluating XML queries over XML views of relational data. However, users must be familiar with these various query syntax. Katsifodimos et al. [5] considered choosing the 306 Informatica 41 (2017) 305–315 Y.-F. Huang et al. best views to materialize within a given space budget to improve the performance of a query. Roantree and Liu [6] approach is to segment a materialized view into fragments to minimize the effect of view changes. Bonifati et al. [7] presented an algebraic approach for propagating source updates to materialized views. Wu et al. [8, 9] proposed a bitmapped materialized views approach for optimizing XML queries. Gosain et al. [10] provided a survey of materialized view evolution methods, which aims at studying the materialized view evolution in relational databases and data warehouses as well as in a distributed setting. Gosain and Sachdeva [11] drew several conclusions about the status quo of materialized view selection and a future outlook is predicted on bridging the large gaps that were found in the existing methods. Another research direction was to construct index or access methods to query XML documents, also improving query processing. Some studies investigated constructing index methods to query XML documents [12-15]. Bruno et al. [12] and Jiang et al. [13] used a structure join method to determine element relationships based on the numbering scheme. This method has good performances for an ancestor-descendant axis, but it might fetch useless nodes for a parent-child axis, because all descendant nodes must be accessed to check if they are real children. Therefore, Huang and Wang developed an efficient query processing algorithm for retrieving XML documents [14]. Hsu et al. also proposed a path clustering method based on the concept of summary indexes for the processing of both structural and content queries on XML documents [15]. Karthiga and Gunasekaran [16] used tree-based association rules to mine the semantics from XML documents, which provide information on both the structure and the content of XML documents. The mined knowledge is used to provide the quick answers to queries and an approach called path based indexing is used to improve the speed of data retrieval. Alghamdi et al. [17], and Thi Le et al. [18] proposed approaches to optimizing twig queries by utilizing the semantics/constraints defined in XML schemas. Furthermore, Ordonez focused on the optimization of linear recursive queries in SQL [19]. Subramaniam and Haw [20] proposed an XML labeling scheme that helps quick determination of structural relationship among XML nodes and supports dynamic updates without relabeling nodes in case of update occurrences. Belgamwar et al. [21] follows an upside down approach which explicitly stores the values and only reconstructs the internal nodes, if needed. As a solution, they proposed a compressed internal storage format for native XML database systems where the inner structure of the gathered documents is virtualized. Ferro and Silvello [22] introduced a new paradigm where traditional approaches based on traversing trees are replaced by a brand new one based on basic set operations which directly return the desired subtree, avoiding to create it. Tudor [23] proposed an optimization model for XML data processing based on a heuristic algorithm to extract data from XPath views. 3 Basic concepts 3.1 XML documents XML is a markup language which was proposed by W3C in 1996. The main purpose of the standard language is to provide data descriptions and data exchanges across different platforms. Like other markup languages, the contexts of XML are declared between start and end tags; however, especially different from others, the tags can be flexibly defined by users to describe data, and furthermore XML is supported in different platforms and systems. That is why it becomes the most common format for data exchanges. An XML document is with a nested structure, and it could be represented as a rooted, ordered, and labeled tree structure. Figure 1 and Figure 2 illustrate an XML document and its corresponding tree representation, respectively. In the document, there is a unique root element called “root” and one of the descendant elements, called “Book”, has seven child element nodes; i.e., Title, Chapter, Para, Author with an attribute node “Id”, Publisher, Name, Email, and their texts. The symbols as shown in Figure 2 are circles, rectangles, and triangles; they represent elements, texts, and attributes, respectively. How to know XML Introduction to XML Your First XML John XML tech John@hpdiy.zzn.com ... Figure 1: XML document. 3.2 XPath XPath (XML Path Language) is an expression language for addressing and querying an XML document. In XPath expressions, each step is separated by "/" and contains three components: axis, node test, and predicate. Axis defines the relationship to be followed in the document tree. Node test defines what kind of nodes Accelerating XML Query Processing on Views Informatica 41 (2017) 305–315 307 is required. Predicate is optional and provides the capability to filter nodes, according to selection criteria. Given an XPath example “//child::Publisher [child::Name='XML tech'] /child::Email” , it is to get the email of the publisher whose name is “XML tech”. When navigating the XML document, it must start from the root element “root”, then the descendant node “Publisher”. Beneath “Publisher”, we search the child nodes to find the node called “Email”. Besides, during the search, it must have a child node called “Name” whose text matches with the specified predicate “XML tech”. In general, the example above can be abbreviated to “//Publisher [Name ='XML tech'] /Email”. Figure 2: XML document tree. 3.3 Labeling schemes One of the major query searches is to determine the relationships between nodes. In order to determine element relationships quickly, several different labeling schemes have been proposed. O'Connor and Roantree categorized labeling schemes into containment schemes, prefix schemes and prime number schemes [24]. Here, labeling schemes are classified into prefix-based ones and region-based ones (or containment schemes). Dewey code [25] is a prefix-based labeling scheme that records the position information of a node, according to the path from the root to the node. For example, Dewey-id of node “Para” is 1.1.1.1.2.2, and indicates that we can get node “Para” if we search alone the path (the first node of level 1, the first node of level 2, the first node of level 3, the first node of level 4, the second node of level 5, the second node of level 6). Besides, since (1.1.1.1.2) is the prefix of (1.1.1.1.2.2), the relationship between node “Chapter” (1.1.1.1.2) and “Para” (1.1.1.1.2.2) can be deduced as a parent-child one. However, the drawback of the prefix-based labeling scheme is its lengthy Dewey codes, especially when the levels of an XML document tree are too deep. The region-based labeling scheme [12] is another numbering scheme. The label contains three elements (start, end, level) where the start value and end value forms a region. The region of an upper-level node (i.e., ancestor or parent) must cover those of lower-level nodes (i.e. children or descendants). In other words, if node A covers node B, then A.start < B.start and B.end < A.end. Besides, the level value represents the node level in a document tree. With the coverage information, we can determine the relationships between nodes quickly. As for the labeling, we can label each node by traversing an XML document tree in a depth-first search way. 3.4 XML document storage An XML documents can be stored in a few different forms, such as in flat files, in relational databases, and in native XML databases. For an XML document to be stored in flat files, we need to parse the files in advance before accessing them. Although it is the simplest form, the parsing time would be very lengthy when the XML document size is too large. Besides, it also incurs multi- user access and concurrency control problems. For an XML document to be stored in relational database, since the XML document is a tree structure, it must use some middleware to translate the XML format into relational tables. Besides, when querying the XML document, it is also necessary to translate a query into an SQL statement, and execute join operations repeatedly among different relation tables, so that it exposes lower efficiency. Native XML databases aim to provide complete XML document storage and manipulation. Different from other database systems, native XML databases use an XML document as a basic unit of storage, and defines an XML model used to store and retrieve XML documents. 3.5 Materialized views A view is a virtual and derived table defined by users for facilitating to express a complicated query. Rather than physically stored as parts of a database, a view definition is merely recorded by the database system. It is evaluated only when a user issues a query involving this view. However, a materialized view is the one which is physically stored in the database, in addition to its definition. Absolutely, the performance of a materialized view is better than a non-materialized view because not only these data can be accessed without re- materialization, but also they can be fetched faster by building indexes on these data beforehand. Thus, a materialized view is usually used in accessing a large amount of data, such as a data warehouse or in business intelligence applications, where we need to take more time to query them. A data warehouse is a subject- oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision- making process. It can be accessed by decision makers through OLAP queries, almost read operations. In short, our design is for decision makers to accelerate XML query processing in a data warehouse. In this paper, based on native XML databases, we use materialized views to query required data from an original document. Here, a materialized view can be defined using the “CREATE MATERIALIZED VIEW” 308 Informatica 41 (2017) 305–315 Y.-F. Huang et al. function and an XPath expression. For the materialized views on an original document, we build auxiliary files and construct indexes using numbering schemes to avoid unnecessary sub-tree traversal, thereby improving the navigation efficiency of a query. 4 System architecture 4.1 Overview In order to achieve faster query processing on the views defined in a native XML database, we propose a system architecture consisting of an offline phase and an online phase, as shown in Figure 3. In the offline phase called view processing, we build view-relevant structures such as T-Bitmap and indexes to accelerate later query processing. In the online phase called query processing, the system can promptly respond to view-based queries, utilizing the T-Bitmap and indexes built beforehand. Figure 3: System architecture. 4.2 View processing In this section, the motivation of view materialization is introduced first. Then, we build relevant structures such as T-Bitmap and indexes on materialized views to further accelerate query processing. 4.2.1 View pre-processing Usually, an Xpath expression is used to address and query an XML document. However, for the query execution, the system always searches an XML document tree from the root. When a query is frequently executed, the system performance would be degraded since a large amount of unnecessary sub-tree traversal cannot be avoided. For the query with an Xpath expression as shown in Figure 4, we can define a materialized view beforehand, which is rooted from node “Books” with an attribute “category” matching with the specified predicate “Technology”, as shown in Figure 5. Then, the materialized view can be created from the original document, as shown in Figure 6. Thus, rather than traversing the original document tree always from the root, the system only needs to search the materialized view, thereby improving the navigation efficiency of the query. Figure 4: Query with an XPath expression. Figure 5: View definition. How to know XML Introduction to XML Your First XML John XML tech John@hpdiy.zzn.com Small World Q&A The One Jimmy Network Jimmy@hpdiy.zzn.com ... Figure 6: Materialized view. Before building T-Bitmap and indexes on a materialized view to further accelerate query processing, we must determine the relationships between nodes (i.e., parent-child axes and ancestor-descendant axes) in a materialized view using the region-based labeling scheme as mentioned in Section 3.3. We traverse a materialized view and label nodes in a depth-first search way. When a node is visited first, its start value is created; when we leave the node, the end value is labeled. After traversing the whole materialized view, all the nodes in the view are completely labeled as shown in Figure 7. CREATE MATERIALIZED VIEW mv AS( SELECT extract(sys_nc_rowinfo$, '/root/store/Books[@category="Technology"]') FROM XMLTABLE); XPath:/root/store/Books[@category="Technology"] /Book[Title="How to know XML"]/Publisher /Email=‘John@hpdiy.zzn.com’ Accelerating XML Query Processing on Views Informatica 41 (2017) 305–315 309 Book(1,25) Title(2,4) How to know XML (3,3) Introduction to XML (6,6) Para(7,9) Your First XML (8,8) Chapter(5,10) Id(12,14) Q345 (13,13) John(15,15) Author(11,16) Publisher(17,24) Name(18,20) XML tech (19,19) John@hpdiy.zzn.com (22,22) Email(21,23) Books(0,251) … Figure 7: Labeling nodes in a depth-first search way. 4.2.2 Building T-Bitmap T-Bitmap is a bit string type, which is used to record what descendant nodes are beneath a current node. First, a dictionary recording the positions in T-Bitmap and the corresponding tags is created, as shown in Table 1. Then, the T-Bitmap value on each node can be calculated using OR operators. For an example as shown in Figure 8, the T-Bitmap of node “Publisher” can be calculated by combining the T-Bitmaps of “Name”, “Email”, and itself using OR operators. Table 1: Dictionary: positions in T-Bitmap and corresponding tags. Position 0 1 2 3 Tag Books Book Title Chapter Position 4 5 6 7 Tag Para Author Id Publisher Position 8 9 - - Tag Name Email - - Name 0000000010 OR Email 0000000001 ----------------------------------------------------- 0000000011 OR Publisher 0000000100 ----------------------------------------------------- 0000000111 (b) Title(2,4) (0010000000) Chapter (5,10) Author(11,16) Publisher(17,24) Book(1,25) (0111111111) Para(7,9) (0000100000) Id(12,14) (0000001000) Name(18,20) (0000000010) Email(21,23) (0000000001) (0000000111)(0000011000)(0001100000) OR operator OR operator Level2 Level3 Level4 (a) Level1 ... Books(0,251) (1111111111) OR operator Figure 8: Combing T-Bitmaps using OR operators. 4.2.3 Building index Here, we use labeling codes to build two kinds of index trees; i.e., tag index trees and value index trees. To illustrate the tag index construction, we extend the storage model in Figure 7. As shown in Figure 9, we can see a lot of nodes with the same tag names but with the different labeling codes; e.g., node “Email(21, 23)” and “Email(46, 48)”. We can build the index tree of each tag using the start values in labeling codes as keys and the well-known B+-tree algorithm, as shown in Figure 10 where the pointers of a leaf node in the tag index tree indicate the positions of corresponding nodes in the materialized view. For the query with an XPath: “Book//Email”, when processing the current node “Book(26, 50)”, we can use the tag index tree of “Email” to locate each leaf node by following the dotted path, and find out node “Email(46, 48)” covered by node “Book(26, 50)”. Figure 9: Extended storage model. Figure 10: Tag index tree. Besides, we can also build a value index tree according to the text values of nodes in the document, as shown in Figure 11. The construction method is the same as that used to build tag index trees. However, we generate only one value index tree for each materialized view, and the records of a leaf node are with the [text, start, pointer] format where the pointers also indicate the positions of corresponding nodes in the materialized view. 4.3 Query processing In this section, the query transformation based on a materialized view is introduced first. Then, according to different axes specified in the transformed query, we make use of the T-Bitmap and indexes built in the view processing to accelerate query processing. Finally, we also introduce subsequent processing for a query specifying the particular predicate in an Xpath expression. John,15 Introduction to XML,6 Q345,13 The One,33 pA854,38 pHow to know XML,3 pIntroduction to XML,6 pJimmy@hpdiy.zzn.com,47 pJimmy,40 pJohn,15 pNetwork,44 pJohn@hpdiy.zzn.com,22 pQ345,13 pSmall World,28 pQ&A,31 pThe One,33 PYour First XML,8 pXML,19 Figure 11: Value index tree. 310 Informatica 41 (2017) 305–315 Y.-F. Huang et al. 4.3.1 Query pre-processing After materialized views are created, a query should be transformed based on its corresponding materialized view. For the query as shown in Figure 4, its Xpath expression (traversing from node root) is transformed into a new one (traversing from node “Books”) as shown in Figure 12. Figure 12: Query transformation based on a materialized view. Before utilizing T-Bitmap and indexes to accelerate query processing, we must deal with parent-child axes and/or ancestor-descendant axes specified in the transformed query. We execute navigation or indexing according to different axes, and recursively check nodes in the document tree. 4.3.2 Navigation For a parent-child axis, we use a navigation way to search nodes in the document tree. Here, T-Bitmap can be used to avoid unnecessary search during the navigation, since it provides the information whether result nodes are beneath the current processing node. For a query “/R/A/C” as shown in Figure 13, we can use an AND operator to determine whether node C is beneath the current processing node. First, for current node R, Query(11010)^R(11111)=Query(11010) indicates node C is beneath node R. Then, for the next node A1, Query(01010)^A1(01100)Query(01010) indicates node C cannot be beneath node A1, and we do not need to search the sub-tree rooted at node A1. Next, for node A2, Query(01010)^A2(01110)=Query(01010) indicates node C is beneath node A2. Finally, we recursively check nodes in the document tree until node C is found. 4.3.3 Indexing For a query “/A//B∙ ∙ ∙” specifying an ancestor- descendant axis as shown in Figure 14, although we can also use T-Bitmap to search node B, the search based on a parent-child axis would go through a lot of unnecessary intermediate nodes. Therefore, we use indexes to directly search a descendant node, instead of using T-Bitmap. As mentioned in Section 4.2.3, we use the start value in the labeling code of the ancestor as the key to search the tag index tree of the descendant. Then, we check whether the descendant node is covered by the ancestor node; if yes, we fetch the descendant node and proceed to parse the query downward. 4.3.4 Subsequent processing In this section, we investigate the processing for the query specifying a particular predicate. One is the query specifying values, and another is the twig query. 4.3.4.1 Query specifying values For a query “/Books/Book[Author = ‘Jimmy’]/Publisher = ‘XML tech’ ”, two values are specified for filtering nodes in the document. As shown in Figure 15, we use the value index tree as mentioned in Section 4.2.3 to find out value nodes “Jimmy(12)”, “XML tech1(7)”, and “XML tech2(15)”, and then put them into their corresponding queues, respectively. When processing node “Book” in the query, we fetch node “Book1”, and find that node “Jimmy(12)” is not covered by node “Book1”; i.e., [2,9] < 12. Then, for the next node “Book2”, although node “Book2” covers node “Jimmy(12)”, node “XML tech1(7)” is not covered by node “Book2”. Then, we choose the next value node “XML tech2(15)” and do the same inspection. Finally, we find out node “Book2” is the result node. Book1 1111 Books Book2 John XML tech1 Jimmy XML tech2 0111 0111 0010 0001 0010 0001 Author1 Author2Publisher1 Publisher2 (1,18) (2,9) (10,17) (3,5) (6,8) (11,13) (14,16) 4 7 12 15 Book Jimmy 0111 0010 0001 Author Publisher Query: /Books/Book[Author = ‘Jimmy’]/Publisher = ‘XML tech’ BookBooks PublisherAuthor T-Bitmap XML tech Jimmy XML tech2 XML tech1 V1 V2 V1 queue V2 queue Books 1111 Figure 15: Value node processing. SELECT extract(sys_nc_rowinfo$,'/Books /Book[Title="How to know XML"]/Publisher /Email=' John@hpdiy.zzn.com '') FROM mv; R A C Query 11010 01010 00010 R A3A2A1 B B C B D 11111 01100 00100 00100 00010 00100 00001 01110 01101 R A B C D T-Bitmap Figure 13: Navigation using T-Bitmap. A B ....... ...... Ancestor-descendant relationship intermedia nodes Figure 14: Unnecessary intermediate nodes. Accelerating XML Query Processing on Views Informatica 41 (2017) 305–315 311 4.3.4.2 Twig Query Besides, for a query “/Book[Publisher]/Author” as shown in Figure 16(a), we can also process it in the same way as done in Section 4.3.2. As shown in Figure 16(b), two solutions “Book, Publisher, Author1, Jim” and “Book, Publisher, Author2, Jimmy” exist in the document. They can be found by 1) merging Path1 and Path2, and 2) merging Path1 and Path3, as shown in Figure 16(c). Book Publisher Author Book Publisher Author1 Author2 Jim Jimmy PublisherBook Author T-Bitmap Book Publisher Book Author1 Book Author2 twig query /Book[Publisher]/Author (a) XML document (b) (c) Path 1 Path 2 Path 3 Figure 16: Twig query processing. 5 Experiments In this section, four experiments are conducted to show the superiority of our approach proposed in this paper. These experiments are written in Java (JDK1.7) and conducted on an Intel Pentium4 3GHz CPU with 3G main memory in Windows 7. In the first experiment, we present the comparisons among different methods. In the second experiment, we investigate the effect of query types on different methods, especially on our method. In the third experiment, we use synthesis documents to analyze our method, and try to find some characteristics. In the last experiment, we address the space vs. time issue if multiple materialized views can be integrated together to save the space; in other words, more than one query would search from a materialized view. 5.1 Comparisons among different methods In this experiment, we compare the search ways in different methods as shown in Figure 17. The first method is the original search way which is always from the root of a document tree. The second method was proposed by Godfrey et al. [1], which searches from the root of a materialized view. The last method is ours which also searches from the root of a materialized view, but with the aid of auxiliary data structures. To fairly compare with the method proposed by Godfrey et al., an XML benchmark available on the XMark site is used in the experiment, which has data size 113.794MB and 1,513,518 nodes. Also, we follow the similar query types and comparison ways used in the experiments conducted by Godfrey et al. As shown in Table 2, there are twelve different types of queries tested in the experiment. To contrast with the searching ways as shown in Figure 17, the columns as shown in Table 3 are 1) query types, 2) searching time on the original tree, 3) view creation time, 4) searching time using Godfrey et al.' method, and 5) searching time using our method. The experimental results show that our method performs much better than the Godfrey et al.' method in all the queries, especially for long-path and twig queries Q3, Q7, Q8, Q9, and Q11. This is why our method uses the T-Bitmap and index structures to accelerate query processing. 5.2 Effect of query types on different methods In this experiment, we use the same XML benchmark as the first experiment, but different queries as shown in Table 4. For the searching axis, Q1 and Q2 are based on the parent-child axis, Q3 and Q4 on the ancestor- descendant axis, and the others on mixed axes. For the query types, Q1, Q3, Q5, Q6, and Q7 are path queries, whereas the others are twig queries. Furthermore, Q1, Q3, Q4, Q5, Q6, Q7, and Q10 are with value predicates. In this experiment, as shown in Table 5, our method is still the best one among different methods in all the queries. Even taking the worst case Q8 as an example, our method is 206 times faster than the original way, and 88 times faster than the Godfrey et al.' method. The reason is, as shown in Table 6, the number of nodes visited for Q8 in our method is only 1/28 times of the original way, and is only 1/10 times of the Godfrey et al.' method. For the path queries (i.e., Q1, Q3, Q5, Q6, and Q7), both execution time of the Godfrey et al.' method and ours is less than one second. For the twig queries (i.e., Q2, Q8, and Q9), they cost more execution time than the path queries since a large amount of nodes are visited. However, for the similar twig queries (i.e., Q4 and Q10), they do not cost much execution time since only a very small amount of nodes are required to visit. In summary, the advantages of our method are 1) when dealing with twig queries, we only need to check T-Bitmap and skip an entire sub-tree if not matched, and 2) when dealing with the queries with value predicates, we can use the value index tree to achieve efficient processing. From begining to end rootOriginal search Index search Godfrey method Our method Figure 17: Search ways in different methods. 312 Informatica 41 (2017) 305–315 Y.-F. Huang et al. 5.3 Experiments on synthesis documents In this experiment, six synthesis documents with different fanout are used to analyze our method for three different types of queries as shown in Table 7. Q1 is based on the parent-child axis, Q2 is based on the ancestor-descendant axis, and Q3 is a twig query with three predicates and based on mixed axes. As shown in Table 8, we find that the execution time of each query increases as the fanout increases. Moreover, regardless of the complexity in Q3, it still costs almost the same time as Q1 and Q2 using our method; i.e., its execution time would not increase significantly even if it is a twig query with three predicates. However, for Q3 using the original way and/or the Godfrey et al.' method, their execution time increases seriously as shown in Table 9. Especially for the Godfrey et al.' method, the execution time for fanout30 is 202 times slower than that for fanout5. 5.4 Experiments on space vs. time In this experiment, we explore the space vs. time issue when multiple materialized views are integrated together. In order to achieve the premise that more than one query can search from a materialized view, we reuse seven queries (i.e., Q3, Q4, Q5, Q6, Q7, Q8 and Q10) as shown in Table 4. Then, we find that 1) Q3, Q6, and Q7 can search from the materialized view built based on Q3, 2) Q4 and Q10 can search from the materialized view built based on Q4, and 3) Q5 and Q8 can search from the materialized view built based on Q8. Thus, after the integration, we have three materialized views for these seven queries. The data space and execution time between no integration and integration are shown in Table 10. Taking the group (Q3, Q6, Q7) as an example, the overall data space is 3+1+3=7(KB) and the total execution time is 3+2+12=17(ms) if each query has its own materialized view; however, after the integration, the data space is only 3(KB), but the total execution time increases to 3+5+26=34(ms). In order to explore the relationship between data space and execution time for these two strategies (i.e., no-integration and integration), we define two terms: 1) amount ratio for data space and 2) speed ratio for execution time as follows. Table 2: Twelve kinds of queries. Q1 /site/regions/europe/item/mailbox/mail Q2 /site//item/mailbox/mail Q3 /site//africa/item/description/parlist/listitem Q4 /site//person/profile/interest[@category] Q5 /site//person/profile[age]/interest[@category="category620"] Q6 /site//person/profile[contains(age,"18")]/education Q7 /site//open_auction[@id="open_auction5"]//date Q8 /site//category/description[text]/parlist/listitem Q9 /site//category/description[text/keyword]/parlist/listitem Q10 /site/*/*/item/mailbox/mail Q11 /site//*//africa/item/name Q12 /site//item/mailbox[count(mail)] Table 3: Comparisons among different methods. Time(ms) Query Original tree % View creation * View non- indexed # View indexed Q1 193718 30131 64980 5752 Q2 195146 32147 84574 20599 Q3 310315 31137 100130 593 Q4 185361 32890 99890 29436 Q5 193474 31147 147344 6865 Q6 191034 29702 65248 9106 Q7 203447 27112 132522 279 Q8 212511 28317 60353 232 Q9 241417 30169 64384 916 Q10 185357 31603 80727 19811 Q11 199274 28417 65059 320 Q12 209115 31731 99283 20687 % View creation: Views created for both methods * View non-indexed: Godfrey et al.' method # View indexed: Ours Accelerating XML Query Processing on Views Informatica 41 (2017) 305–315 313 ( ) ( ) strategy space strategy amount ratio space original   (1) ( ) ( ) strategy time strategy speed ratio time original   (2) where space(strategy) is the overall data space used in the no-integration or integration strategies, time(strategy) is the total execution time required in the no-integration or integration strategies, space(original) is the data space of the original document, and time(original) is the execution time on the original document. Then, we can use these two terms to judge which strategy is better for the system performance as follows. integration no integration no integration integration amount ratio speed ratio amount ratio speed ratio        (3) or ( ) ( ) ( ) ( ) space integration time no integration space no integration time integration    (4) For Equation (4), the former term represents that the data space benefits for the integration strategy can be Table 4: Different kinds of queries. Q1 Q2 /site/open_auctions/open_auction[@id="open_auction5"]/initial /site/open_auctions/open_auction[annotation/author]/bidder/date Q3 Q4 //site//open_auctions//open_auction[@id="open_auction0"]//current //person[@id="person0"][creditcard]//watch Q5 Q6 Q7 Q8 Q9 Q10 /site/regions//item[@id="item0"]//mail //open_auction[@id="open_auction0"]/bidder/date /site/open_auctions/open_auction[@id="open_auction0"]/../end /site/regions//item[//text/bold]//location //closed_auctions/closed_auction[//description/text]/seller //people/person[@id="person0"][//business]/name Table 5: Execution time. Time(ms) Query Original tree View creation View non- indexed View indexed Q1 31818 14920 193 8 Q2 68769 15731 37949 1883 Q3 29718 11787 140 3 Q4 30989 10056 103 3 Q5 27976 10705 119 3 Q6 30123 11723 48 2 Q7 32680 12135 139 12 Q8 4344137 222996 1853537 21111 Q9 1560306 246697 201444 16185 Q10 28425 10374 70 2 Table 6: Number of nodes visited. Nodes Query Original tree View non- indexed View indexed Q1 48005 57 5 Q2 259604 87331 23794 Q3 24001 64 3 Q4 31502 24 8 Q5 34124 50 3 Q6 25960 8 5 Q7 24005 64 3 Q8 1204343 432651 43502 Q9 736217 102736 39003 Q10 25647 24 5 Table 7: Three kinds of queries. Q1 /root/L1/R Q2 //root//R Q3 /root//L1[//R][//Q][//S] 314 Informatica 41 (2017) 305–315 Y.-F. Huang et al. gained (i.e., data space reduced) whereas the latter term represents that the execution time benefits for the no- integration strategy can be gained (i.e., execution time reduced). If Equation (4) with the equal weighting between space and time is true, the integration strategy should be adopted. According to the data taken from Table 10, we can calculate the former term and latter term in Equation (4), as shown in Table 11. From the statistical data, we find that the integration strategy is better for group (Q3, Q6, Q7) and group (Q4, Q10), but the no-integration strategy is better for group (Q5, Q8). Absolutely, different strategies can be adopted for different query groups at the same time to make the system performance in the best status. 6 Conclusions In this paper, we employ some processing methods on XML documents to improve document retrieval. The goal is to reduce the number of visited nodes during tree traversing, thereby leading to faster query processing. To achieve this goal, we focus on the usage of database views. First, we build a materialized view from an original document for each query. Then, on each materialized view, we also build auxiliary structures such as T-Bitmap and indexes to further accelerate query processing. According to different axes specified in an Xpath expression, we have different techniques to handle them. Finally, through the experiments, we 1) compare the performances among different methods, 2) investigate the effect of query types on them, 3) use Table 8: Comparisons for different fanout in our method. Time(ms) Q1 Q2 Q3 Fanout5 358 326 392 Fanout10 571 554 569 Fanout15 1065 969 1023 Fanout20 1514 1432 1486 Fanout25 2164 2023 2137 Fanout30 2701 2579 2658 Table 9: Comparisons for different fanout in different methods. Q3 Time(ms) Original tree View non- indexed View indexed Fanout5 143618 419 392 Fanout10 278615 796 569 Fanout15 436672 2700 1023 Fanout20 655059 10815 1486 Fanout25 678379 38339 2137 Fanout30 879231 84779 2658 Table 10: Comparisons between no integration and integration. No integration Space(KB) Time(ms) Q3 3 3 Q4 1 3 Q5 2 3 Q6 1 2 Q7 3 12 Q8 56226 21111 Q10 1 2 Integration Space(KB) Time(ms) (Q3, Q6, Q7) 3 3+5+26 (Q4, Q10) 1 3+2 (Q5, Q8) 56226 16491+21111 Table 11: Comparisons based on Equation (4). (Q3, Q6, Q7) (Q4, Q10) (Q5, Q8) Former term 3/7=0.43 1/2=0.5 56226/56228=1 Latter term 17/34=0.5 5/5=1 21114/37602=0.56 Accelerating XML Query Processing on Views Informatica 41 (2017) 305–315 315 synthesis documents to analyze our method, and 4) address the space vs. time issue if materialized views are integrated together. References [1] Godfrey P, Gryz J, Hoppe A, et al. Query rewrites with views for XML in DB2. In: Ioannidis Y, Lee D, Ng R, eds. Proceedings of the IEEE 25th International Conference on Data Engineering, Shanghai, China, 2009. 1339-1350 [2] Murthy R, Banerjee S. XML schemas in Oracle XML DB. In: VLDB Endowment, eds. Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, Germany, 2003. 1009-1018 [3] Halevy Y. Answering queries using views: a survey. Very Large Data Bases Journal, 2001, 10: 270-294 [4] Jayavel S, Jerry K, Eugene S, et al. Querying XML views of relational data. In: VLDB Endowment, eds. Proceedings of the 27th International Conference on Very Large Data Bases, Rome, Italy, 2001. 261-270 [5] Katsifodimos A, Manolescu I, Vassalos V. Materialized view selection for XQuery workloads. In: Fuxman A, eds. Proceedings of ACM SIGMOD International Conference on Management of Data, Scottsdale, Arizona, USA, 2012. 565-576 [6] Roantree M, Liu J. A heuristic approach to selecting views for materialization. Software: Practice and Experience, 2014, 44: 1157-1179 [7] Bonifati A, Goodfellow M, Manolescu I, et al. Algebraic incremental maintenance of XML views. ACM Transactions on Database Systems, 2013, 38: 14:1-14:45 [8] Wu X, Theodoratos D, Wang W H, et al. Optimizing XML queries: bitmapped materialized views vs. indexes. Information Systems, 2013, 38: 863-884 [9] Wu X, Theodoratos D, Kementsietsidis A. Configuring bitmap materialized views for optimizing XML queries. World Wide Web, 2015, 18: 607-632 [10] Gosain A, Sabharwal S, Gupta R. Architecture based materialized view evolution: a review. Procedia Computer Science, 2015, 48:256-262 [11] Gosain A, Sachdeva K. A systematic review on materialized view selection. In: Satapathy S et al., eds. Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications. Advances in Intelligent Systems and Computing, Vol. 515. Springer, Singapore, 2017. 663-671 [12] Bruno N, Koudas N, Srivastava D. Holistic twig joins: optimal XML pattern matching. In: Franklin M J, eds. Proceedings of ACM SIGMOD International Conference on Management of Data, Madison, WI, USA, 2002. 310-321 [13] Jiang H, Wang W, Lu H, et al. Holistic twig joins on indexed XML documents. In: VLDB Endowment, eds. Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, Germany, 2003. 273-284 [14] Huang Y F, Wang S H. An efficient XML processing based on combining T-Bitmap and index techniques. In: Biaz S, Bellaachia A, eds. Proceedings of the IEEE International Symposium on Computers and Communication, Marrakech, Morocco, 2008. 858-863 [15] Hsu W C, Liao I E, Wu S Y, et al. An efficient XML indexing method based on path clustering. In: Alhajj R S, eds. Proceedings of the 20th IASTED International Conference on Modeling and Simulation, Banff, Alberta, Canada, 2009. 339-344 [16] Karthiga D, Gunasekaran S. Optimization of query processing in XML document using TAR and path based indexing. International Journal of Computer Science and Network Security, 2013, 13: 119-127 [17] Alghamdi N S, Rahayu W, Pardede E. Object-based semantic partitioning for XML twig query optimization. In: Barolli L et al., eds. Proceedings of the IEEE 27th International Conference on Advanced Information Networking and Applications, Barcelona, Spain, 2013. 846-853 [18] Thi Le D X, Maghaydah M, Orgun M A, et al. Optimization of XML queries by using semantics in XML schemas and the document structure. In: Lin X et al., eds. Proceedings of the 14th International Conference on Web Information Systems Engineering, Nanjing, China, 2013. 343-353 [19] Ordonez C. Optimization of linear recursive queries in SQL. IEEE Transactions on Knowledge and Data Engineering, 2010, 22: 264-277 [20] Subramaniam S, Haw S C. ME labeling: a robust hybrid scheme for dynamic update in XML databases. In: Ismail M, Ramli N, eds. Proceedings of the IEEE 2nd International Symposium on Telecommunication Technologies, Langkawi, Malaysia, 2014. 126-131 [21] Belgamwar H C, Dhore S M, Rathod P U, Deshmukh S S, Nandanwar G S. Review on storing and indexing XML documents upside down. International Journal for Engineering Applications and Technology, 2015, Manthan-15 [22] Ferro N, Silvello G. Descendants, ancestors, children and parent: a set-based approach to efficiently address XPath primitives. Information Processing & Management, 2016, 52:399-429 [23] Tudor N L. Query optimization against XML data. Studies in Informatics and Control, 2016, 25:173- 180 [24] O’Connor M F, Roantree M. Desirable properties for XML update mechanisms. In: Proceedings of the 2010 EDBT/ICDT Workshops, Lausanne, Switzerland, 2010 [25] Lu J, Ling T W, Chan C Y, et al. From region encoding to extended Dewey: on efficient processing of XML twig pattern matching. In: VLDB Endowment, eds. Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, 2005. 193-204