Optimizing the execution of XSLT stylesheets for querying transformed XML data

Abstract

We have to deal with different data formats whenever data formats evolve or data must be integrated from heterogeneous systems. These data when implemented in XML for data exchange cannot be shared freely among applications without data transformation. A common approach to solve this problem is to convert the entire XML data from their source format to the applications' target formats using the transformations rules specified in XSLT stylesheets. However, in many cases, not all XML data are required to be transformed except for a smaller part described by a user's query (application). In this paper, we present an approach that optimizes the execution time of an XSLT stylesheet for answering a given XPath query by modifying the XSLT stylesheet in such a way that it would (a) capture only the parts in the XML data that are relevant to the query and (b) process only those XSLT instructions that are relevant to the query. We prove the correctness of our optimization approach, analyze its complexity and present experimental results. The experimental results show that our approach performs the best in terms of execution time, especially when many cost-intensive XSLT instructions can be excluded in the XSLT stylesheet. [PUBLICATION ABSTRACT]

Full text

Translate

Turn on search term navigation

Knowl Inf Syst (2009) 18:331391

DOI 10.1007/s10115-008-0144-4

REGULAR PAPER

Sven Groppe Jinghua Groppe Stefan Bttcher

Thomas Wycisk Le Gruenwald

Received: 13 October 2006 / Revised: 3 April 2008 / Accepted: 25 April 2008 / Published online: 11 June 2008 Springer-Verlag London Limited 2008

Abstract We have to deal with different data formats whenever data formats evolve or data must be integrated from heterogeneous systems. These data when implemented in XML for data exchange cannot be shared freely among applications without data transformation. A common approach to solve this problem is to convert the entire XML data from their source format to the applications target formats using the transformations rules specied in XSLT stylesheets. However, in many cases, not all XML data are required to be transformed except for a smaller part described by a users query (application). In this paper, we present an approach that optimizes the execution time of an XSLT stylesheet for answering a given XPath query by modifying the XSLT stylesheet in such a way that it would (a) capture only the parts in the XML data that are relevant to the query and (b) process only those XSLT instructions that are relevant to the query. We prove the correctness of our optimization approach, analyze its complexity and present experimental results. The experimental results show that our approach performs the best in terms of execution time, especially when many cost-intensive XSLT instructions can be excluded in the XSLT stylesheet.

Keywords XML XSLT Query reformulation Optimization Transformation Views

S. Groppe (B) J. Groppe

University of Lbeck, Institute of Information Systems (IFIS), Ratzeburger Allee 160, 23538 Lbeck, Germanye-mail: [email protected]

J. Groppee-mail: [email protected]

S. Bttcher T. Wycisk

University of Paderborn, Faculty 5, Frstenallee 11, 33102 Paderborn, Germany e-mail: [email protected]

T. Wyciske-mail: [email protected]

L. Gruenwald

University of Oklahoma, School of Computer Science, Norman, OK 73019, USA e-mail: [email protected]

Optimizing the execution of XSLT stylesheets for querying transformed XML data

123

332 S. Groppe et al.

1 Introduction

XML has increasingly become a popular tool for representing information to be exchanged among organizations. Since no uniform formats for different types of information exist, applications usually dene their own XML formats to store their data. A problem arises in that XML data with different formats cannot be shared and exchanged freely without some types of data transformation. A satellite technology of XML, called Extensible Stylesheet Language for Transformation (XSLT) [44], has been developed to implement the task of transformation of XML data. XSLT embeds the XPath language [45] for the purpose of addressing XML nodes of an XML document. XPath can also serve as an independent query language for retrieving specied data from XML documents.

When an application wants to access a subset of XML data with a format different than its own format, the common approach that it can follow is to transform the entire XML document from the source format into the target format. Figure 1 describes the data access and transformation process. The XSLT processor transforms the source XML document Daccording to the given XSLT stylesheet S into the target XML document S(D). The needed data for the application, Q(S(D)), is obtained by applying a query Q to the transformed XML document S(D).

The common approach performs well when all the data or most of the data from the source document is needed. When the size of the queried data is small compared to the entire transformed document, the common approach performs worse compared to approaches that transform only the required parts of the XML data instead of all the XML data [24].

An optimized approach, the query transformation approach, which optimizes the execution time of answering queries on XML data that is transformed on-demand according to a given XSLT stylesheet whenever a query is set up, has been presented in [24]. Instead of transforming the entire source XML document, the query transformation approach only transforms parts of the source document that are necessary to answer a given query by transforming the query from the target format into the source format according to the XSLT stylesheet. Figure 2 depicts the architecture of the optimized approach. A given query Q is transformed by the query transformation approach into a query R formulated in the source format according to the given XSLT stylesheet S. The transformed query R is used to retrieve the sufcient data R(D) from the source document D. The retrieved data R(D) is then transformed into the target format to get S(R(D)) by applying the XSLT stylesheet S. At last, Qis applied to S(R(D)) in order to obtain the needed data Q(S(R(D))).

The query transformation approach improves the performance of the data access and transformation by projecting rst the input data to a sufcient part and then transforming only the sufcient part. However, this approach does not avoid the execution of unnecessary XSLT instructions in the XSLT stylesheet S. To ll in this gap, in this paper, we present the XSLT optimization for XPath queries approach (XSLT optimization) (see Fig. 3). Rather

Fig. 1 The common approach of XML data access and XML document transformation

XSLT Processor

XPath Evaluator

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 333

Fig. 2 Data access and transformation of the query transformation approach

Fig. 3 Data access and transformations of the XSLT optimization approach

than transforming a given query from the target format into the source format according to an XSLT stylesheet, our new XSLT optimization approach reconstructs the given XSLT stylesheet S so that the new XSLT stylesheet S contains the rules for transforming the XML document to only the part of the XML document that is relevant to answer a given query Q. First, we compute a new XSLT stylesheet S by optimizing the given XSLT stylesheet Saccording to the given query Q. Then the XSLT processor applies the new XSLT stylesheet S to the source XML document D in order to transform D to only the part of the target data that is relevant to answer the given query Q. Finally, the given query Q is applied to retrieve the needed data Q(S(D)). We show that whenever the XSLT stylesheet S contains many of cost-intensive XSLT instructions that are not relevant to answer a given XPath query Q, the XSLT optimization approach has considerable advantages compared to both the query transformation approach and the common approach in terms of execution time. Furthermore, we avoid the additional step of retrieving the resultant XML fragment R(D), i.e., we do not have to implement the retrieval of R(D). Besides the XSLT stylesheet optimizer, we use only standard XSLT processors and XPath evaluators that are well-tested products, which are freely available or commercial products.

Note that the proposed optimizations (see Figs. 3, 2, respectively) are independent of whether S (D) (or S(R(D)), respectively) is materialized or Q(S (D)) (or Q(S(R(D))), respectively) is executed at once, and lead in both cases to similar experimental results. Note that the proposed optimizations can be also applied in multiple queries scenarios dealing

123

334 S. Groppe et al.

with many XPath queries Q1, , Qm by determining S (or R, respectively) for the query Q = Q1|. . .|Qm. After materializing S (D) (or S(R(D)), respectively), the result of each

single query Qi {Q1, . . ., Qm} can be determined by applying Qi(S (D)) (or Qi(S(R(D))),

respectively).

The rest of the paper is organized as follows. Section 2 introduces the basic concepts including the XPath language, the XSLT language and an XSLT processor model that is later used in order to prove the correctness of the approach. Section 3 presents the search for those output nodes in the given XSLT stylesheet S that are relevant to answer the given query. Section 4 describes how to construct the optimized XSLT stylesheet S based on the result of the search so that S computes only the part of the target data that is relevant to answer the given query. Furthermore, Sects. 3 and 4 also include a complexity analysis of time and space required by the presented algorithms. Section 5 presents the results of the experiments conducted to compare the execution times of the XSLT optimization approach,the common approach and the query transformation approach. Section 6 discusses the related work. The summary and conclusions are given in Sect. 7. Section 8 contains the Appendix, which introduces a formal XSLT output data model for a precise and formal denition of the search described in Sect. 3. Furthermore, the Appendix covers a proof of correctness of the XSLT optimization approach based on a formal XSLT processor model.

2 Basic concepts

The main underlying technologies involved in our work are the XPath language and the XSLT language. Sections 2.1 and 2.2 present the essentials of XPath and XSLT, respectively, and Sect. 2.3 illustrates the common approach and the ideas of our approach with examples.

2.1 XPath Basics

The XML Path language (XPath) [45] is a W3C-developed language for specifying node sets in XML documents. XPath can either be used as an independent query language or is embedded in other XML languages like XSLT and XQuery.

The semantic of each XPath expression is dened in terms of the semantic of its subexpressions. The basic expression is called a location step a :: n[q1]. . .[qi], which consists

of an axis a and a nodetest n with or without predicates q1, . . ., qi , e.g., child::paperand child::section[child::text()="Introduction"]. XPath denes 13 axes, each of which identies a set of nodes related to the context node, which is the specied node or the document node of the XML document. For example, the child axis contains all the child XML nodes of the current input XML node, the descendant axis contains all XML nodes of the transitive closure of the child XML nodes, and the attribute axis contains all the attribute nodes of the current input XML node. A node test lters nodes from the node set identied by the axis by the node name or by the node type. XPath spec-ies seven node types. For example, the node test paper selects the nodes with the name paper, and the node test text() selects text nodes. Axis and node test of a location step select a set of XML nodes relevant to a context node, which is further ltered by the predicates. Predicates further select nodes from the set of nodes ltered by the node test. A predicate is a Boolean expression which tests every node in the context node set. If the result of a predicate is true for a given node, the node remains in the node set. For example, [child::text()="Introduction"] checks whether the content of a text node of the context node is Introduction. Location steps are separated by the token /, and

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 335

the nodes selected by a location step are the context nodes of the next location step. Note that an XPath expression with a leading forward slash / is called an absolute XPath expression, indicating that the navigation in the XML nodes starts at the document node; otherwise, it is called a relative XPath expression, indicating that the navigation starts from a context node, which is the current input node of the XML document. Figure 4 contains an example of an XPath expression Q.

The XPath language also denes several abbreviations, e.g., /child::a is abbreviated to /a, attribute::a to @a and // represents /descendant-or-self::node()/, which can be easily transformed into an equivalent long form. We use the more general long form in this paper in order to consider all cases of XPath.

The XPath expression in Fig. 4 contains four location steps and selects the pages attribute of section elements, which have a text node with value Introduction, the parent of which is a tableOfContents element, which is a child node of the root element node paper.

2.2 XSLT basics

The Extensible Stylesheet Language for Transformation (XSLT), a part of the Extensible Stylesheet Language (XSL) [44], is another language developed by W3C. It is used to specify rules for transforming XML documents from one format to another format. XSLT stylesheets themselves are XML documents and dene their own elements typically with the namespace prex xsl, e.g., the root element of the XSLT stylesheet in Fig. 6 is <xsl:stylesheet>in(S1). XSLT embeds XPath to select nodes for the transformation of an XML document and a subset of XPath for the test of whether a pattern matches the context node.

An XSLT stylesheet consists of a set of transformation rules, called templates, each of which starts with <xsl:template>. Figure 5 contains an example input XML document for the XSLT stylesheet in Fig. 6, the result of which is the XML document presented in Fig. 7. The XSLT stylesheet of Fig. 6 contains an XSLT stylesheet with two template rules which start in node (S2) and in node (S10). The elements which are contained in templates are called XSLT instructions. An XSLT instruction describes how to transform the input XML document. For example, in Fig. 6, (S3) outputs a new element called paper, (S4)outputs an attribute called title, (S5) copies the value of the node of the input XML

Fig. 4 Example of an XPath expression

Fig. 5 Example of an input XML document D

123

336 S. Groppe et al.

Fig. 6 Example of an XSLT stylesheet S

Fig. 7 Output XML document, which is the result of S(D), where S is the XSLT stylesheet in Fig. 6 and Dis the input XML document in Fig. 5

document identied by the XPath expression self::node/attribute::label. The XSLT language supports also XSLT instructions for invoking template rules, e.g., (S9). Hereby, the XSLT processor iterates through the nodes specied in the select attribute of (S9) and invokes the templates with the context node assigned with the current node of the iteration step. A template rule is called when the pattern specied in its match attribute (see (S2)) is fullled for the context node. When more than one template rule matches a given node, the import precedence of imported templates together with the priority of the templates determines the template to invoke. We refer to [44] for a complete list of XSLT instructions, their semantics and the principles of template invocation.

The initial template rule to be invoked is the one that matches the document node / of the input document. If no such a template explicitly exists in the given XSLT stylesheet, the built-in template rule in Fig. 8 (see also [44]) is used as the initial template rule.

Fig. 8 The built-in template, which is the initial template in our example

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 337

Fig. 9 Optimized XSLT stylesheet S after applying the XSLT optimization approach to the XSLT stylesheet S in Fig. 6 and the XPath query in Fig. 4

2.3 Examples of the common approach and of our approach

In this section, to illustrate how each of the two approaches works, we use the same input XML document shown in Fig. 5.

2.3.1 Common approach

The common approach does not perform any query-specic optimization on a given XSLT stylesheet or the input XML document. Instead, the common approach consists of the following two steps. First, the input XML document in Fig. 5 is transformed according to the XSLT stylesheet in Fig. 6 into the XML document in Fig. 7.

Second, the XML document in Fig. 7 is used by another application, which executes a query on this output XML document. For example, when the query Q in Fig. 4 is applied to the output XML document, the second step computes the result, which is 1.

2.3.2 The XSLT optimization approach

In comparison, our approach applies the XPath query Q in Fig. 4 to the XSLT stylesheet in Fig. 6, and generates the optimized XSLT stylesheet in Fig. 9.

When we apply the optimized XSLT stylesheet to the input XML document in Fig. 5, only the parts of the XML document that are relevant to answer the query Q in Fig. 4 are transformed into the output XML document in Fig. 10.

We retrieve the same result 1 when we apply the query Q in Fig. 4 to the XML documents in Figs. 10 and 7.

123

338 S. Groppe et al.

Fig. 10 Output XML document, which is the result of S(D), where S is the optimized XSLT stylesheet of Fig. 9 and D is the input XML document in Fig. 5

3 Computing relevant output nodes from an XSLT stylesheet

In order to transform an XSLT stylesheet according to an XPath query, our approach of the XSLT stylesheet optimization rst searches the paths in the XSLT stylesheet that output elements, attributes and attribute values in the correct order for answering the given XPath query. In Sect. 3.1, we present our search approach which is based on the standard XPath evaluator typically constructed to search in an XML document instead of an XSLT stylesheet.

3.1 Modifying a standard XPath evaluator

In the common approach, an XPath query Q is answered from the output XML document computed by an XSLT processor from the output nodes of an XSLT stylesheet S.

Denition 1 The Output nodes of an XSLT stylesheet S are those XSLT nodes of S, which

generate an element E by the XSLT node <xsl:element name=E>, or generate an attribute A by the XSLT node <xsl:attribute name=A>, or generate text nodes containing the value "content" by the XSLT instructions

<xsl:text>content</xsl:text>, or

copy the content of XML nodes I of the input XML document by the XSLT node

<xsl:value-of select=I/>, or

copy whole element nodes of the input XML document by the XSLT nodes <xsl:copy>

and <xsl:copy-of select=I/>, or

generate a number by the XSLT node <xsl:number>.

For example, the paper element (R1) in Fig. 7 is generated by the node (S3) in Fig. 6. The tableOfContents element (R3) is generated by node (S8). All the sectionelements (R4), (R5), (R6) and (R7) are generated by node (S11). These output nodes (S3), (S8) and (S11) of the XSLT stylesheet S are reached after a sequence of nodes of the XSLT stylesheet S in Fig. 6 (and the initial built-in template in Fig. 8) are executed (e.g., <(S1), (B1), (B2), (S2), (S3), (S8), (S9), (S10), (S11) >). This sequence <(S1), (B1), (B2), (S2), (S3), (S8), (S9), (S10), (S11) > generates an output which is, for example, relevant to answer the XPath query /child::paper/child::tableOfContent/child::section.

We dene the following terms for the search on output nodes of the XSLT stylesheet.

Denition 2 Given an XPath expression XP and an input XML document D, we say that an XML node K of D is successfully visited by an XPath evaluator if the following holds: there exist two token sequences, XP1 and XP2 of XP, so that the concatenation of these token sequences, XP1 and XP2, is equal to XP, i.e., XP = XP1 o XP2. Furthermore, the XPath evaluator must retrieve a node set that contains K if the XPath evaluator parses and executes XP1 on D. We set the current context node of the XPath evaluator to K, and then further parsing and processing of XP2 must return a non-empty result. In this situation, we also say that K is relevant to answer the query XP.

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 339

Example 1 Let XP1 = /child::paper/child::tableOfContent/child::section and XP2 = [child::text() = Introduction]/attribute::pages. The XPath evaluator retrieves the XML nodes (R4), (R5), (R6) and (R7) in Fig. 7 for the evaluation of XP1 on the XML document S(D) in Fig. 6; however, the XPath evaluator retrieves only a non-empty result for the evaluation of XP2 in the context of (R4). Therefore, (R4) is successfully visited by an XPath evaluator for XP = XP1 o XP2 =/child::paper/child::tableOfContents/child::section [child::text()=Introduction]/attribute::pages.

Denition 3 Let D be an input XML document. Let XN be an XSLT node of a given XSLT stylesheet S so that an execution of XN generates an output in the form of an XML node Owhile the XSLT processor transforms D to S(D). XN is relevant to answer the query Q, if Ois relevant to answer the query Q on S(D).

Example 2 (R4) in Fig. 7 is successfully visited while evaluating the XPath query in Fig. 4 (see Example 1). As (R4) is generated by the XSLT node (S11) in Fig. 6, (S11) is relevant to answer the query Q.

Proposition 1 Any XPath expression I containing path expressions can be divided into a relative part rp(I) and an absolute part ap(I) (both of which may be empty) in such a way that rp(I) contains only relative path expressions, ap(I) contains only absolute path expressions, and the union of ap(I) and rp(I), denoted as ap(I)|rp(I), is equivalent to I. This means that the evaluation of I returns the same node set as the evaluation of ap(I)|rp(I) for all XML documents and for all context nodes in the current XML document.

Proof of Proposition 1 If there are disjunctions in I, then we can factor out these disjunctions, i.e., the resultant equivalent XPath expression is of the form I1 |. . .|In. Then

rp(I) = Ir1 |. . .|Irm and ap(I) = Ia1|. . .|Iak, where Ir1, . . ., Irm for r1, . . ., rm {1, . . ., n} are all relative XPath expressions of I1, . . ., In, and Ia1, . . ., Iak for a1, . . ., ak {1, . . ., n} are all absolute XPath expressions of I1, . . ., In. It is obvious that ap(I)|rp(I)

is equivalent to I, ap(I) contains only absolute XPath expressions and rp(I) contains only relative XPath expressions. Example 3 The relative part of I = (/child::resource[attribute::type=paper] | child::contains/child::resource[attribute::type=section])/attribute::writer, which is equivalent with /child::resource [attribute::type=paper]/attribute::writer | child::contains/child::resource[attribute::type=section]/attribute::writer, is rp(I) = child::contains/child::resource[attribute::type=section]/attribute::writer, and the absolute part is ap(I) =/child:: resource[attribute::type=paper]/attribute::writer.

The successor XSLT nodes of an XSLT node N1 are those XSLT nodes N2, which are XSLT nodes executed in context of N1, i.e., which are recursively executed from an XSLT processor while executing N1.

Proposition 2 An XSLT node N2 is the successor XSLT node of an XSLT node N1 if

N2 is a child node of N1 in the XSLT stylesheet, or N1 is an XSLT node with an attribute xsl:use-attribute-sets=N and N2 is an

XSLT node <xsl:attribute-set name=N>, or

123

340 S. Groppe et al.

N1 is an XSLT node <xsl:call-template name=N> and N2 is an XSLT node

<xsl:template name=N>, or

N1 is <xsl:apply-templates select=I/> and N2 is <xsl:template

match=M> and the template of N2 is called for at least one node of the selected node set, which is the result of processing I.

Proof of Proposition 2 We can conclude Proposition 2 from the semantics of the XSLT instructions described in [44].

If we determine the successor XSLT nodes independently of the current input XML document, we can determine a superset of the successor XSLT nodes N2= <xsl:templatematch=M> of an XSLT node N1=<xsl:apply-templates select=I/> by using a tester T that checks whether N2 could be called from N1. In the following, we show the existence of an incomplete tester T by using an incomplete intersection tester for XPath expressions. N2 is called from N1 if the selected node set I in the context of the current input XML node K of N1, I(K), and M are not disjointed, i.e., I(K) M = {} and the templates

with a higher priority than N2 do not consume all XML nodes of I(K) M, which can be

tested by (I(K) M) I(K) M (M1 Mk) = {}.

For technical reasons, the tester T neglects templates with a higher priority as current testers only support subsets of XPath excluding the - operator. There exist fast (but incomplete) testers (e.g., the testers presented in [5] and [6]) for the logical intersection test of XPath expressions. The testers are incomplete but provide one-sided correctness so that if the tester returns disjointed then we are sure that the XPath expressions are disjointed; otherwise, the tester returns maybe not disjointed.

Without a complex static program analysis we consider all possible current XML nodes K, which are described by the XPath expression

XPG := /descendant or self :: node()

(/self::node()|/attribute::node()|/namespace::node()),for I(K) by applying rp(I) on the resultant XML nodes of XPG. Note that the absolute XPath expressions ap(I) are independent from current XML nodes. Thus, we determine a superset of I(K) independent of the current XML node K by

MI := ap(I)|XPG/rp(I)

and we useMM := ap(M)|XPG/rp(M)

instead of M. The tester T can be implemented by returning not called if the result of the logical intersection tester with the input MI and MM is disjointed, and otherwise maybe called (which can also be returned in the case that the intersection tester does not support the used XPath expressions).

Example 4 Table 1 contains all successor XSLT nodes of the XSLT stylesheet S of the example XSLT stylesheet in Fig. 6.

Our approach modies a typical XPath evaluator in order to search for all relevant output nodes of an XSLT stylesheet, which can generate the output that is relevant to answer the given query Q in the correctly nested order. Since a typical XPath evaluator is constructed to compute an XPath query on an input XML document rather than on an XSLT stylesheet, our approach must alter the typical XPath evaluator to search among the output nodes of an XSLT stylesheet in the following way:

Similar to the search by a standard XPath evaluator, our approach starts the search at the

root node of the XSLT stylesheet document,

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 341

Table 1 Successor XSLT nodesof the XSLT stylesheet in Fig. 6 XSLT node Successor XSLT nodes

(S1) (S2), (S10)

(S2) (S3)

(S3) (S4), (S6), (S8)

(S4) (S5)

(S5)

(S6) (S7)

(S7)

(S8) (S9)

(S9) (S10)

(S10) (S11), (S17)

(S11) (S12), (S14), (S16)

(S12) (S13)

(S13)

(S14) (S15)

(S15)

(S16)

(S17) (S10)

The search can pass non-output nodes as they do not generate output data, The search continues from an XSLT node N1 to an XSLT node N2 if N2 is a successor

XSLT node of N1,

The evaluation of lter expressions in Q is processed as follows:

If a lter expression contains a comparison op {=, ! =, <, <=, >, >=} of a value

V with a constant C, i.e., V op C, and the value V is generated by an XSLT node <xsl:text>const</xsl:text>, our modied XPath evaluator can decide that the comparison will always be true, or always be false, respectively, by evaluating the comparison const op C.

If a lter expression contains a comparison of value V with a constant C and the value

V is generated by an XSLT instruction <xsl:value-of select=I/>, our approach can transform the expression containing the comparison into an expression in the format of the input XML document, which will restrict the node set of the input XML document more precisely (see Sects. 4.2, 4.3).

If the output nodes that are relevant to answer an XPath expression in a lter cannot

be computed from the XSLT stylesheet, our evaluator can always evaluate this part of the lter to be false.

If the output nodes, which are relevant to answer an XPath expression XP in a lter,

are generated for every input XML document and XP is in the scope of an odd number of not operators in the lter (e.g., the path /a/b in not(not(not(/a/b))) is in the scope of an odd number (3) of not operators), our evaluator can always evaluate this part of the lter to be false.

In all other cases, our evaluator assumes that a part of the lter expression can be

evaluated to be true.

123

342 S. Groppe et al.

If a variable $v is referred to in a local input path expression by $v/XP, where XP is

an XPath expression, then we recursively search in the variable denition for that XPath expression XP, which restricts the content of $v in the local input path expression.

In comparison to the evaluation on XML documents, it may occur that during the search in

XSLT stylesheets, the modied XPath evaluator revisits a node N of the XSLT stylesheet without any progress in the processing of the queries. For example, the XPath evaluator can visit the XSLT nodes (S1), (S2), (S3), (S8), (S9), (S10), then the XSLT node (S17) and the XSLT node (S10) again in Fig. 6. We call this a loop. In this example, the loop contains the nodes (S17) and (S10). In order to avoid an innite search, we do not continue the search at the nal node N, e.g., the node (S17) in this example, when the loop is detected.

We describe the search on output nodes of an XSLT stylesheet according to an XPath query in more detail in the Appendix in Sect. 8.2.

3.2 Logging the search

Our modied XPath evaluator must log different paths in the XSLT stylesheet when searching for the relevant output nodes of the XSLT stylesheet according to a given XPath query. We use the log information for the further computation of the transformed XSLT stylesheet Sin Sect. 4. In this section, we dene the types of paths in the XSLT stylesheet for computing the relevant output nodes and explain how the modied XPath evaluator uses these types of paths in order to log the search.

Denition 4 A stylesheet path is a sequence of pointers to either the stylesheet path records (XP, SN, sp) or the stylesheet path records (h, sp1, . . ., spn), where

XP is an XPath expression, SN is a sequence of pair nodes in an XSLT stylesheet and the direction (f for forward

and b for backward),

sp, sp1, . . ., spn are stylesheet paths (which we call the attached stylesheet paths), or

a predicate expression q, and q {true(), false(), self :: node() = C}, where C

is a literal, i.e., a number or a string, and

h is a keyword.

XP is the part of a given XPath query, which still has to be evaluated; SN is a sequence of XSLT nodes (and their direction) containing as the last node the current node of an XSLT stylesheet when processing XP. f represents either a stylesheet path list computed from a predicate q that tests the last node of SN, or the predicate expression q itself from which no stylesheet paths can be computed like true() or false(), but also including self::node()=C. h represents operators like or, and and not.

We call an attached stylesheet path, a subsequence of XSLT nodes in forward direction, which corresponds to a subsequence of XSLT nodes with backward direction, or a loop in the sequence SN a branched path.

We log the search in the following way:

We call the stylesheet path, which contains all the visited XSLT nodes of the path from

the start node <xsl:stylesheet> to the current node of the search in the visited order, the current stylesheet path. Each entry (XP,SN,sp) of the current stylesheet path contains the remaining part XP of Q, which still has to be searched for at the visited XSLT node N.

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 343

We call the stylesheet paths, which begin with the start node <xsl:stylesheet> and

may generate output that is relevant to Q, detected stylesheet paths.

Each operator like or, and and not in Q leads to one corresponding entry in the current

stylesheet path.

The modied XPath evaluator needs to return from the current XSLT node N1 to an

already visited XSLT node N2 in the current stylesheet path if an XPath query contains:

a reverse axis, i.e., parent, ancestor, ancestor-or-self, preceding or

preceding-sibling axis, or

a lter expression, e.g., a/b[c=d]/e, or a built-in function, which has XPath expressions as parameters, e.g., a[count(b/c)

=d]/e.

In most cases, we then store the stylesheet paths starting at the XSLT node N2 and ending at the XSLT node N1 in the attached stylesheet paths, which are attached to the XSLT node N2 of the current stylesheet path. Although in the case that the result of a considered path is empty, a Boolean expression could be evaluated to true. For example, XML nodes of the path b/c need not be available in order to for count(b/c)=d to be evaluated to true if d can be 0. Then, we mark the attached stylesheet path starting at the XSLT node N2 and ending at the XSLT node N1 to be not considered when computing the ipe in Sect. 4.2.

Furthermore, if a variable is referenced in a local input path expression, we store the

attached stylesheet path of the required part of a variable denition.

We can transform the currently processed lter of Q into a lter in the XML format of

the input XML document (see Sect. 4). When a lter expression contains a comparison op {=, ! =, <, <=, >, >=} of a value V with a constant C and the value is copied from

the input XML document by the XSLT instruction <xsl:value-of select=I/>, we store op, C and the stylesheet paths resulting from a search for V in the attached stylesheet paths.

For each loop detected for a current XSLT node N, we generate a stylesheet path loop,

which is the current stylesheet path minus the stylesheet path of the rst visit of N. We store loop as an entry in SN because we need to know the input, which is consumed when the XSLT processor executes the nodes of loop.

We describe all technical details of the altered XPath evaluator in the Appendix in Sect. 8.2. The altered XPath evaluator is based on an XSLT Output Data Model, which we describe in the Appendix in Sect. 8.1. The altered XPath evaluator described in the Appendix in Sect. 8.2 supports all XPath axes, predicates, comparisons in predicates, the not-built-in function and the binary and, or and union operators. However, our approach can be extended to full XPath so that in case of doubt, we can consider a larger fragment in the XSLT stylesheet to generate more output than needed. The XSLT optimization approach is still correct, as in the last step the original XPath query is applied on the result of the XSLT stylesheet. This applies also to positional predicates of XPath expressions. We discuss the supported subset of XPath, which is larger than the supported subset of the altered XPath evaluator described in Sect. 8.1, of our implemented prototype in Sect. 5.1 in more detail.

Figure 11 marks six parts, XP1, . . . ,XP6, of the example XPath query in Fig. 4. Figure 12 describes the result of the output path search for the XPath query in Fig. 4, and Fig. 13 is a graphical representation of the result of the output path search in Fig. 12. In Fig. 13, we show which entries, E1, , E5, contain which sequence of XSLT nodes of the XSLT stylesheet in Fig. 6.

123

344 S. Groppe et al.

Fig. 11 Parts XP1,, XP6 of the example XPath query of Fig. 4

{< E1:=(XP1, <(S1, f) >, ),

E2:=(XP2/child::text(), <(S2, f), (S3, f) > , ),

E3:=(XP3/child::text(), <(S8, f)>, ),

E4:=(XP4/child::text(), <(S9, f), (S10, f), ((S17, f), (S10, f))n, (S11, f)>,

{(child::text()[self::node()=Introduction], <(S16, f)>, {self::node=Introduction})}), E5:=(XP6/child::text(), <(S12, f)>, )>,E6:=(child::text(), <(S13, f)>, )>}

Fig. 12 Result of the output path search for the XPath query Q in Fig. 4 with XSLT stylesheet in Fig. 6

3.3 Complexity analysis of the output path search

Let us assume that the search for output nodes is implemented analogously to that in the altered XPath evaluator described in Sect. 3.1. Let a be the number of location steps in query Q and let |S| be the number of nodes in the XSLT stylesheet S. For each stylesheet path of the result of the previous location step, we can detect at most O(a|S| |S|1i=1(|S|!/(|S| i)!)) =

O(a|S||S|!) different stylesheet paths as the result of the current location step for the following reasons: the axis of a current location step in the worst case is a preceding or a following axis. Each stylesheet path contains at most a|S| nodes, each of which can be the start node of a stylesheet path of length 1 to |S|-1 until we surely detect a loop. At most all other XSLT nodes are the successor XSLT nodes of an XSLT node. Thus, we can detect O(|S|!/(|S|-i)!) different stylesheet paths of length i and at most O( |S|1i=1(|S|!/(|S| i)!)) different stylesheet paths of length 1 to |S|-1. Furthermore,

|S|1i=1(|S|!/(|S| i)!) = |S|! |S|1i=1(1/(|S| i)!) = |S|! |S|1j=1(1/j!) < |S|! 2.

Thus, we can detect at most O((a|S||S|!)a) different stylesheet paths, each of which contains at most O(a*|S|) XSLT nodes, for Q.

Therefore, the worst case complexity of the search in terms of run time and space is O(a|S|(a|S||S|!)a).

3.3.1 Complexity analysis for the typical case

In this section, we want to examine the complexity analysis of the search in terms of run time and space for the typical case. For this purpose, we consider some properties of the 39 XSLT stylesheets of the XSLTMark benchmark [13] and some properties of the XPath queries of the XPathMark benchmark [18]. We expect that the properties of real XSLT stylesheets and real XPath queries do not differ much from the determined properties of the benchmark XSLT stylesheets and benchmark XPath queries, as the purpose of these benchmark XSLT stylesheets and benchmark XPath queries is to reect the XSLT stylesheets and XPath queries in real-world scenarios. Here, the determined properties of the benchmark XSLT stylesheets and benchmark XPath queries represent the typical case and we conclude the complexity analysis for the typical case based on these properties.

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 345

Table 2 contains some properties of the XSLT stylesheets of the XSLTMark benchmark. The average number of XSLT nodes in these XSLT stylesheets is 66.5, which is much less than the number of XML nodes in typical input XML documents. Furthermore, the average number of successor XSLT nodes is 0.42, i.e., less than half of the XSLT nodes have a successor XSLT node. The average maximum number of successor XSLT nodes in an XSLT stylesheet is approximately 6, which is a hint that XSLT stylesheets are not at XML documents, but are deep XML documents. The average maximum number of XSLT nodes generating elements with the same name is approximately 3, which demonstrates that XSLT stylesheets do not often generate similar output at different places in the XSLT stylesheet. Furthermore, most XSLT stylesheets (64%) generate elements with certain names only at one place in the XSLT stylesheets. The other XSLT stylesheets often generate layout information like several html tables, where the XSLT stylesheets often generate the <tr> elements for table rows and <td> elements for table columns.

Table 3 contains some properties of the XPath queries of the XPathMark benchmark. The average number of location steps is approximately 5, from which 80% are location steps with a child axis and from which only 8% are location steps with a recursive XPath axis, i.e., a descendant, descendant-or-self, ancestor, ancestor-or-self, preceding or following axis, the evaluation of which consumes much time in the search algorithm. 85% of the location steps are location steps with a name test as node test and only 13% of the location steps contain the unspecic node tests node() or wildcard (*). We have measured the selectivities of XPathMark queries applied to an XPathMark input XML document with a le size of 10 MB. The selectivities of the XPathMark queries vary from low (many queries) to high (few queries). The average selectivity of the queries is12.27% of the input XML document.

Thus, we assume that the typical case for real scenarios is characterized as follows: each

XSLT node in the XSLT stylesheet S has only a small number of successor XSLT nodes and S does not generate any redundant output. Furthermore, we assume that the query Q species a small node set so that we only detect a small number, which is less than a constant k, of detected stylesheet paths for each location step. Therefore, the complexity of the search in terms of run time and space is O(k*a*|S|) in the typical case.

4 Optimizing XSLT stylesheets

In this section, we present how to transform a given XSLT stylesheet S according to a given XPath query Q into an optimized XSLT stylesheet S so that Q(S(D)), instead of Q(S(D)), can be executed for answering Q on the transformed XML document of an arbitrary input XML document D.

The detected stylesheet paths (and their branched paths) describe the part of the XSLT stylesheet which generates a relevant output for the given input XPath query Q. The rst idea, which we describe in Sect. 4.1, is to eliminate all other XSLT instructions in the XSLT stylesheet S except the XSLT instructions on a detected stylesheet path (or its branched paths).

Only if a whole detected stylesheet path is executed, output that is relevant to answer Q is generated. The ideas of the approaches presented in Sects. 4.24.4 are to abort the execution of S as soon as possible whenever a whole detected stylesheet path cannot be executed completely. In Sect. 4.2, we describe how to construct lter expressions that are fullled for a current context node of an XSLT node N on a detected stylesheet path if the rest of a whole detected stylesheet path starting from N can be executed. We insert these lter expressions constructed according to Sect. 4.2 into the local input path expressions, which

123

346 S. Groppe et al.

Table 2 The XSLT stylesheets of the XSLTMark benchmark [13] and some of their properties

XSLT stylesheet Total number Average number Maximum number Maximum number of of XSLTMark of XSLT of successor of successor XSLT nodes generating benchmark nodes XSLT nodes XSLT nodes elements withthe same name

alphabetize.xsl 33 0.37 4 1

attsets.xsl 52 0.44 7 1

avts.xsl 23 0.39 4 1

axis.xsl 69 0.41 6 2

backwards.xsl 22 0.37 3 1

bottles.xsl 80 0.43 7 1

brutal.xsl 157 0.45 4 19

chart.xsl 90 0.41 3 6

creation.xsl 38 0.42 5 1

current.xsl 39 0.38 4 1

dbonerow.xsl 105 0.42 7 13

dbtail.xsl 31 0.39 3 1

decoy.xsl 129 0.39 21 12

encrypt.xsl 20 0.35 3 1

find.xsl 26 0.38 4 1

functions.xsl 57 0.39 7 1

game.xsl 77 0.43 3 9

html.xsl 66 0.47 4 4

identity.xsl 19 0.37 3 1

inventory.xsl 44 0.43 3 2

metric.xsl 60 0.38 4 1

number.xsl 58 0.62 19 1

oddtemplate.xsl 50 0.44 8 1

patterns.xsl 67 0.39 9 1

prettyprint.xsl 81 0.38 4 3

priority.xsl 76 0.37 4 3

products.xsl 45 0.4 4 2

queens.xsl 240 0.43 8 1

reverser.xsl 42 0.38 3 1

stringsort.xsl 29 0.38 4 1

summarize.xsl 44 0.59 8 1

total.xsl 47 0.4 3 5

tower.xsl 123 0.44 8 1

trend.xsl 34 0.38 3 1

union.xsl 36 0.44 5 1

xpath.xsl 30 0.43 5 1

xslbench1.xsl 233 0.46 16 14

xslbench2.xsl 71 0.58 10 2

xslbench3.xsl 49 0.47 7 1 Average 66.5 0.42 6.08 3.1

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 347

Table 3 Some properties of the XPath queries of the XPathMark benchmark [18]

Query Number of Number of Number of Number of Number of Selectivity in location location location location location steps % (XPathMark steps steps with steps with steps with with wildcards or XML document child axis recursive axis name test node() node test 10 MB)

Q1 4 4 0 3 1 49.76

Q2 9 9 0 9 0 0.24

Q3 2 1 1 1 1 4.21

Q4 2 0 2 2 0 2.15

Q5 6 4 0 5 1 25.18

Q6 3 1 2 2 1 35.94

Q7 3 1 2 2 1 9.98

Q8 9 6 0 9 0 0

Q9 6 4 0 6 0 0

Q10 6 4 1 5 1 49.74

Q11 9 6 1 9 0 0

Q12 3 1 1 2 1 5.48

Q13 3 1 1 1 2 85.96

Q14 3 1 1 2 1 9.64

Q15 3 1 1 1 2 0.35

Q16 2 1 1 1 1 0

Q17 1 1 0 0 1 100

Q18 1 1 0 0 0 0

Q19 1 1 0 0 0 0

Q20 1 1 0 0 0 0

Q21 9 7 1 6 2 0.0008

Q22 8 8 0 8 0 25.18

Q23 6 6 0 6 0 4.19

Q24 4 4 0 4 0 4.66

Q25 1 1 0 1 0 0

Q26 7 5 0 7 0 0

Q27 9 6 0 9 0 0

Q28 9 5 0 9 0 0

Q29 8 6 0 8 0 0

Q30 4 4 0 4 0 0.19

Q31 4 4 0 4 0 10.85

Q32 2 1 1 0 2 49.76

Q33 2 1 1 0 2 0

Q34 2 1 1 0 2 0

Q35 2 1 1 0 2 0

Q36 5 5 0 4 1 6.60

Q37 4 4 0 4 0 0.04

Q38 7 7 0 6 1 1.37

Q39 7 7 0 6 1 0.53

Q40 7 7 0 6 1 4.13

123

348 S. Groppe et al.

Table 3 continued

Query Number Number of Number of Number of Number of Selectivity in location location location location location steps % (XPathMark steps steps with steps with steps with with wildcards or XML document child axis recursive axis name test node() node test 10 MB)

Q41 7 7 0 6 1 7.79

Q42 5 5 0 4 1 30.08

Q43 11 11 0 11 0 3.14

Q44 5 5 0 5 0 12.03

Q45 5 5 0 5 0 13.17

Q46 6 6 0 6 0 19.49

Q47 5 5 0 5 0 4.98 Average 4.85 3.89(80%) 0.41(8%) 4.13(85%) 0.64(13%) 12.27

we describe in Sect. 4.3. Furthermore, we insert test nodes that check the lter expressions constructed according to Sect. 4.2 when detected stylesheet paths with common XSLT nodes start to separate, which we describe in Sect. 4.4. After that, we present the whole optimization algorithm in pseudo code in Sect. 4.5. We present further optimization methods in Sect. 4.6. We then conduct a complexity analysis of the presented optimization methods in Sect. 4.7.

4.1 Reducing the XSLT stylesheet to the XSLT nodes of the detected stylesheet paths

We have to guarantee that evaluating the given input XPath query Q using the optimized XSLT stylesheet S yields the same result as evaluating the given input XPath query Q using the original XSLT stylesheet S for every input XML document D. This includes the case in which the execution of an XSLT instruction, <xsl:apply-templates select=I>, on a detected stylesheet path or on a branched path calls the same templates in the original and optimized XSLT stylesheets. In general, if we delete a complete template T1 starting with the XSLT node <xsl:template match=M/>, which is not in the detected stylesheet paths (and their branched paths), another template T2 in S could be called for a node N in I. If T2 is in the detected stylesheet paths (or their branched paths), then there could exist a situation in that evaluating the given input XPath query Q using S would yield more results than evaluating Q using S. Therefore, we must not delete any template head consisting of the XSLT instruction, <xsl:template match=M/>. However, we can delete anything in the template body (i.e., any child nodes of the XSLT instruction, <xsl:templatematch=M/>), which are not in the detected stylesheet paths (and their branched paths).

The rst optimizing step can be summarized as follows: eliminate all XSLT instructions in the XSLT stylesheet S except XSLT instructions in the detected stylesheet paths (and their branched paths), and except all template heads, <xsl:template match=M/>.

Example 5 Given the XPath query, /child::paper/child::tableOfContents/child::section[child::text()=Introduction]/attribute::pages, we reduce the XSLT stylesheet in Fig. 6 to the following:

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 349

<xsl:stylesheet><xsl:template match="/child::resource[attribute::type='paper']"> <xsl:element name='paper'> <xsl:element name='tableOfContents'>

<xsl:apply-templates

select="child::contains/child::resource[attribute::type='section']"/>

</xsl:element> </xsl:element></xsl:template>

<xsl:template

match="child::contains/child::resource[attribute::type='section']">

<xsl:element name='section'>

<xsl:attribute name='pages'>

<xsl:value-of select="attribute::size"/>

</xsl:attribute>

<xsl:value-of select="attribute::label"/>

</xsl:element>

<xsl:apply-templates

select="child::contains/child::resource[attribute::type='section']"/></xsl:template></xsl:stylesheet>

4.2 Filter expressions that are fullled if a whole detected stylesheet path is executed

Denition 5 For a current input node K of an XSLT node N on a detected stylesheet path, the current input node is relevant to answer the given query if the rest of a whole detected stylesheet path starting from N with the current input node K can be executed.

In this section, we describe how to construct lter expressions to lter current input nodes that are relevant to answer the given query (see Denition 5).

Denition 6 (input nodes): The input nodes of an XSLT stylesheet S with local input path expression I are

<xsl:apply-templates select=I/>, <xsl:copy>, where I=self::node(), <xsl:copy-of select=I>, where I=I/descendant-or-self::node()

(/self::node() |/attribute::node()|/namespace::node()) as <xsl:copy-of select=I> copies whole subtrees of I,

<xsl:element name="{I}">, where we deal with this XSLT node as it is on an

attached stylesheet path,

<xsl:attribute name="{I}">, where we deal with this XSLT node as it is on

an attached stylesheet path,

<xsl:processing-instruction name="{I}">, where we deal with this

XSLT node as it is on an attached stylesheet path,

<xsl:value-of select=I/> and <xsl:for-each select=I>.

Furthermore, the following XSLT nodes, which we call test nodes, can restrict the current input XML nodes of the input XML document by a Boolean test expression T.

Denition 7 (test nodes): The test nodes of an XSLT stylesheet S with a Boolean test expression T are

<xsl:if test=T>, <xsl:when test=T> and

123

350 S. Groppe et al.

<xsl:otherwise> where T is implicitly not(T1) and and not(Ti) if

there are i>0 different XSLT nodes, <xsl:when test=T1>, , <xsl:whentest=Ti>, in the same <xsl:choose> scope, or T is implicitly true() if there is no XSLT node, <xsl:when>.

In order to construct the lter expression, we have to combine all the local input path expressions of input nodes and test expressions of test nodes along the XSLT nodes of a detected stylesheet path (and its branched paths) starting at a current XSLT node N. For this purpose, we use a variable input path expression (ipe) dened as follows:

The ipe contains the whole input path expression of the detected stylesheet path from a current XSLT node to the end of the detected stylesheet path. The following holds for the node set described by ipe in context of the current input node of N. Let W be the node set containing all current input nodes while the XSLT processor processes the last XSLT node of the detected stylesheet path. The node set described by ipe in the context of the current input node of N contains all XML nodes of W.

We now design different computing steps (see column 2 of Table 4) for ipe in order to fulll the above-mentioned requirement by collecting all input path expressions along the detected stylesheet path and combining them in the correct way depending on the type of the current node and the paths attached to the current node (see column 1 of Table 4). The correctness of these computing steps can be veried by comparing them with the semantic of the XSLT language [44].

In order to compute ipe for each node along the detected stylesheet path and its branched paths, we iterate through the detected stylesheet path and for each entry p[k] of the detected stylesheet through the sequence of XSLT nodes p[k].SN.N after the initialization of ipe. Then depending on the current entry, we do the following:

construct new path expressions of the ipe, which we call ipenew. The result is constructed from the local input path expression (I) or test expression (T), respectively, of the current XSLT node and the old input path expressions of the ipe, which we call ipeold.

recursively construct and combine the ipes of the attached stylesheet paths and loops of an

XSLT node B. For this purpose, at rst, we initialize ipe with ipeinit, then recursively construct along the branched path as before and get the ipe, which we call ipepath, after the last node of the branched path. At last we construct ipenew of the node B with the branched path from ipepath and ipeold of B.

Figure 14 presents the construction of ipe, where we use the detected stylesheet path in Fig. 12, starting at the XSLT node (S11), where we initialize ipe with self::node()in order to align the example with examples presented below. Furthermore, Fig. 14 illustrates the construction of ipe starting at the XSLT node (S9), where we initialize ipe with an empty expression, and the construction of ipe starting at the XSLT node (S17), where we initialize ipe with an empty expression using the detected stylesheet path in Fig. 12. Note that the XPath language does not support the arbitrary repetition of an XPath expression A, which we denote An, but we can retrieve a fragment containing a superset of the XML nodes required by replacing An with /descendant-or-self::node() if A contains only child, descendant and descendant-or-self axes, and with the following XPath expression

/ancestor-or-self::node()/descendant-or-self::node()(/self::node()|/attribute::node()|/namespace::node())for an arbitrary A.

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 351

CurrententryComputationofcurrentipeandcompletedipeExampleXSLTnodesinFig.12

ipe new=ipeold(S1),(S2),(S3),(S4),(S8)

Inputnode(exceptNof

ipe new=if(rp(I)isempty)ap(I)[/descendantorself(S9),(S13),(S16),(S17)

(/self::node()|/attribute::node()|/namespace::node())

rp(ipe old)|ap(ipeold)]elseif(rp(ipe old)isnotempty)ipeold/rp(I)|ap(I)

elserp(I)|ap(I)

TestnodewithBooleantest

ipe new=ipeold[T]-

Attachedstylesheetpathipe init=empty(S16)

ipe new=ipeold[ipepath]

Entryrepresentinganoper-

ipe init=ipeold(S16)

Attachedloops1..mipe init=empty(S10)withattachedloop((S17,f),S(10,f))

Table4ComputationrulesforipedependingonthecurrentXSLTnode

ipe new=op(ipepath1,...,ipepathm)

ipe new=(ipeold|ap(ipepath1)|...|ap(ipepathm))

(/rp(ipe path1)|...|/rp(ipepathm))

Neitherinputnode(norNof

Fig.15)nortestnodewithout

branchedpaths

Fig.15)withlocalinputpath

expressionI

atoroplikeor,and,

notandoperators=,!=,

>,>=,<and<=withm

operands

expressionT

123

352 S. Groppe et al.

Fig. 13 Graphical representation of the result of the output path search of Fig. 12. We show which entries, E1,,E5, contain which sequence of XSLT nodes

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 353

XSLT nodelocalinputpathexpressionI

(S5) self::node/attribute::label

(S7) self::node()/attribute::writer

(S9) child::contains/child::resource

[attribute::type='section']

(S13) attribute::size

(S16) attribute::label

(S17) child::contains/child::resource

[attribute::type='section']

(S1)

(S2)

(S3)

(S4)

(S6)

(S8)

(S10)

(S11)

(S12)

(S14)

(S15)

ipe 10=ipe7[ipe9=Introduction]

ipe 11=ipe10

7=ipe3(ipe6)n

ipe 12 = (child::contains/child::resource

[attribute::type='section'])n

[attribute::label=Introduction]

/attribute::size

ipe 12=ipe11/attribute::size

ipe 4=

ipe 9=attribute::label

[attribute::type='section']

Computationof

startingat XSLT node

(S17),ipe 1isinitializedwith

ipe 3=ipe2

theemptyexpression

ipe 8=

ipe 6=ipe5ipe

ipe 1=

ipe 5=child::contains

/child::resource

ipe 2=ipe1

Computationof

ipe 11 = (child::contains/child::resource

[attribute::type='section'])n

[attribute::label=Introduction]

/attribute::size

startingat XSLT node

(S9),ipe 1isinitializedwith

theemptyexpression

6=ipe2(ipe5)n

ipe 9=ipe6[ipe8=Introduction]

ipe 10=ipe9

ipe 11=ipe10/attribute::size

ipe 3=

ipe 8=attribute::label

ipe 4=child::contains

/child::resource

[attribute::type='section']

ipe 1=

ipe 2=ipe1

ipe 7=

ipe 5=ipe4ipe

Computationof

ipe 6= self::node()

[attribute::label=Introduction]

/attribute::size

startingat XSLT node

(S11),ipe 1isinitializedwith

ipe 4=ipe1[ipe3=Introduction]

ipe 5=ipe4

Sequenceof XSLT nodesin attachedstylesheetpath

Sequenceof XSLT nodes

loop

self::node()

ipe 1=self::node()

ipe 6=ipe5/attribute::size

ipe 3=attribute::label

ipe 2=

Fig. 14 Computation of ipe starting at XSLT node (S11), (S9) and (S17) using the detected stylesheet path of Fig. 12

123

354 S. Groppe et al.

Fig. 15 Algorithm computeFilter

Figure 15 contains the computeFilter algorithm, which computes the lter expression that is fullled for a current context node of an XSLT node N on a detected stylesheet path if the rest of the contained XSLT nodes of a whole detected stylesheet path starting from N can be executed. The algorithm, computeFilter, expects an XSLT node N and the detected stylesheet paths P as input and returns the computed lter expression as output. First, the subpaths of the detected stylesheet paths (or one of its branched paths) starting from N to the end of the considered detected stylesheet path (or one of its branched paths) is computed in line (6). Note that if N is on a loop within a sequence of XSLT nodes, then the resultant subpath contains the subpath of the loop from N to the loop head. We further concatenate the resultant subpath with the subpath starting from the loop head to the end of the stylesheet path to which the loop is attached. The branched path can be again a loop so that in this case, we recursively proceed as before. We compute the ipe of each of these subpaths in line(7) and line (8). Hereby, for the rst XSLT node in a subpath, we do not apply the formula for computing ipe in the case of an input node, as we add a lter (see below) in context of the local input path expression of the rst XSLT node. Considering the construction of ipe, at least one XML node of ipe must be available in order to execute the complete path. Therefore, we compute in line (9) the required lter expression to be the disjunction of all the ipes computed before. Finally, we return the computed lter expression in line (10).

4.3 Optimizing the local input path expressions in the optimized XSLT stylesheet

Considering the algorithm to retrieve the detected stylesheet paths and their branched paths, we recognize that the XSLT processor generates an output relevant for answering the query Q only if a detected stylesheet path and its branched paths are completely executed (where loops can be executed an arbitrary number of times). In other words, if the execution of a detected stylesheet path or a branched path is aborted before the detected stylesheet path or the branched path is executed completely, the output of the aborted detected stylesheet path is not relevant for answering Q (see the search algorithm in Sect. 3).

Whenever a new node set is selected by an input node of the XSLT stylesheet, a detected stylesheet path or a branched path is aborted if the following situation occurs: the XML nodes in the input XML document D, which are needed to further execute the considered detected stylesheet path or the considered branched path, respectively, are missing. Furthermore, the output generated on a detected stylesheet path or on a branched path, which does not fulll a lter in the given input XPath query Q, is not relevant to answer the given XPath query Q either.

Therefore, our second optimization idea is as follows: We add a lter F, which we compute according to the way we describe in Sect. 4.2, to each local input path expression of those XSLT instructions, which select a new node set of the input XML document D. If F is not fullled, the considered detected stylesheet path or the branched path cannot be excecuted

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 355

completely. This means the current input node is not relevant to answer the given query. In this case, the added lter F causes the XSLT processor to not process this input node that is not relevant to answer the given query and thus abort the execution at the current XSLT instruction.

Finally, we replace the local input path expression I in the considered input node N with I[F].

Example 6 Given the XPath query, /child::paper/child::tableOfContents/child:: section[child::text()=Introduction]/attribute::pages, we optimize the local input path expressions of the XSLT stylesheet in Example 5 to the following:

<xsl:stylesheet><xsl:template match="/child::resource[attribute::type='paper']"> <xsl:element name='paper'> <xsl:element name='tableOfContents'>

<xsl:apply-templates

select="child::contains/child::resource[attribute::type='section'] [descendant-or-self::node()[attribute::label='Introduction']

/attribute::size]"/>

</xsl:element> </xsl:element></xsl:template>

<xsl:template

match="child::contains/child::resource[attribute::type='section']">

<xsl:element name='section'>

<xsl:attribute name='pages'>

<xsl:value-of select="attribute::size"/>

</xsl:attribute>

<xsl:value-of select="attribute::label"/>

</xsl:element>

<xsl:apply-templates

select="child::contains/child::resource[attribute::type='section'] [descendant-or-self::node()[attribute::label='Introduction']

/attribute::size]"/></xsl:template></xsl:stylesheet>

4.4 Inserting tests at split XSLT nodes

In general, detected stylesheet paths and/or branched paths overlap at certain XSLT nodes in the optimized XSLT stylesheet that we retrieve by applying the approach described in Sect. 4.1 and, afterwards, separate. In other words, two detected stylesheet paths and/or branched paths overlap until a last common node C and continue with different XSLT nodes, SP1, , SPn, which we call split XSLT nodes in the following.

The XSLT processor shall abort the execution of a considered path at the next input node following C when the path starting with an XSLT split node SPi cannot be executed completely. In order to abort the execution of such paths as soon as possible, we insert an <xsl:if test=F> statement as a child of C and a parent of SPi, where the path starting with SPi qualies for the test F. F is generated according to the procedure given in Sect. 4.2 with an initialized ipe being set to self::node().

If SPi is an input node, which is already optimized according to Sect. 4.2, the test F would be checked twice (the rst time in the statement, <xsl:if test=F>, and the second time in the input node). Therefore, we do not insert the statement, <xsl:if test=F>, in this case.

123

356 S. Groppe et al.

Example 7 Given the XPath query, /child::paper/child::tableOfContents/child:: section[child::text()=Introduction]/attribute::pages, we optimize the XSLT stylesheet in Example 6 by inserting test nodes at split XSLT nodes as shown below:

<xsl:stylesheet><xsl:template match="/child::resource[attribute::type='paper']"> <xsl:element name='paper'>

<xsl:element name='tableOfContents'>

<xsl:apply-templates

select="child::contains/child::resource[attribute::type='section']

[descendant-or-self::node()[attribute::label='Introduction']

/attribute::size]"/>

</xsl:element> </xsl:element></xsl:template>

<xsl:template

match="child::contains/child::resource[attribute::type='section']">

<xsl:if test="self::node()[attribute::label='Introduction']/attribute::size">

<xsl:element name='section'>

<xsl:attribute name='pages'>

<xsl:value-of select="attribute::size"/>

</xsl:attribute>

<xsl:value-of select="attribute::label"/>

</xsl:element>

</xsl:if>

<xsl:apply-templates

select="child::contains/child::resource[attribute::type='section']

[descendant-or-self::node()[attribute::label='Introduction']

/attribute::size]"/></xsl:template></xsl:stylesheet>

4.5 Optimization algorithm in pseudo code

Figure 16 contains the entire algorithm, optimizeXSLTStylesheet, for optimizing an XSLT stylesheet according to an XPath query Q. optimizeXSLTStylesheet expects an XSLT stylesheet S and an XPath query Q as input and returns the optimized XSLT stylesheet S. In line (5), the detected stylesheet paths are computed by calling the algorithms presented in Sect. 8.2, i.e., we evaluate the given XPath query Q on the output nodes of the given XSLT stylesheet S and return the logged paths in S. Then the XSLT stylesheet S is rst copied into S (see line (6)). We consider each XSLT node N in S, delete N in line (8) if N is not on a detected stylesheet path (or its branched paths) according to the procedure in Sect. 4.1, restrict the local input path expression of N in line (9) to line (11) if N is an input node according to the approach in Sect. 4.3, and insert a test node in line (12) to line (16) if Nis a split XSLT node according to the procedure in Sect. 4.4. Finally, we return the optimized XSLT stylesheet S in line (17).

4.6 Further optimization

There is further optimization to eliminate a location step in the query and the corresponding output nodes in the XSLT stylesheet that are not relevant to answer the query.

Example 8 Instead of using the XPath query, /child::paper/child::tableOfContents/ child::section[child::text()=Introduction]/

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 357

Fig. 16 Algorithm optimizeXSLTStylesheet

attribute::pages and the XSLT stylesheet in Example 7, we can use the XPath query, /child::section[child::text()= Introduction]/attribute::pages and the following XSLT stylesheet, where the generation of the elements paperand tableOfContent is eliminated.

<xsl:stylesheet><xsl:template match="/child::resource[attribute::type='paper']">

<xsl:apply-templates

select="child::contains/child::resource[attribute::type='section']

[descendant-or-self::node()[attribute::label='Introduction']

/attribute::size]"/></xsl:template>

<xsl:template

match="child::contains/child::resource[attribute::type='section']">

<xsl:if test="self::node()[attribute::label='Introduction']/attribute::size">

<xsl:element name='section'>

<xsl:attribute name='pages'>

<xsl:value-of select="attribute::size"/>

</xsl:attribute>

<xsl:value-of select="attribute::label"/>

</xsl:element>

</xsl:if>

<xsl:apply-templates

select="child::contains/child::resource[attribute::type='section']

[descendant-or-self::node()[attribute::label='Introduction']

/attribute::size]"/></xsl:template></xsl:stylesheet>

In this example, we can further optimize the XSLT stylesheet by eliminating both, the element generation of section and its content, from the XSLT stylesheet S and the sectionlocation step from the query Q. However, in general, we are not sure that inserted test nodes (see Sect. 4.4) and lter expressions (see Sect. 4.3) exclude all cases in which the lter expressions of the original XPath query Q are not fullled.

We do not present these optimization rules here but refer interested readers to [21], which state some of these optimization rules for XQuery. Grinev and Pleshachkov [21] describes a different algorithm, which is rule-based. Grinev and Pleshachkov [21] describes an optimization approach for XQuery. In comparison with [21], we describe an optimization approach

123

358 S. Groppe et al.

for XSLT in this paper, such that the results from [21] cannot be transferred. Furthermore, [21] only supports child and descendant axes for these kind of optimizations, while our contribution supports all XPath axes.

4.7 Complexity analysis

Let |S| be the number of XSLT nodes in S and let a be the number of location steps in Q. The runtime complexity of the optimization presented in Sect. 4.1 consists of the runtime of the search (see Sect. 3.3), the runtime of copying the visited XSLT nodes, which is O(|S|), and the runtime of inserting lters into the input nodes of S and inserting conditional XSLT instructions into S. For both approaches presented in Sects. 4.3 and 4.4, for each of |S| XSLT nodes, we have to consider at most O(a|S|(a|S||S|!)a) XSLT nodes (see Sect. 3.3) and

O(ka|S|) XSLT nodes in the typical case (see Sect. 3.3.1) for the computation of the ipe. Thus, the overall worst case runtime complexity is O(|S|2a(a|S||S|!)a), but the overall

runtime complexity is O(ka|S|2) in the typical case. The space complexity is dominated

by the space used for the detected stylesheet paths, which is O(a|S|(a|S||S|!)a) in the

worst case and O(ka|S|) in the typical case.

5 Performance analysis

This section reports the experiments we have conducted to study in which cases which approach (common approach, query transformation approach or XSLT optimization approach) performs the best. In the different cases, we vary the size of the input XML documents (see Sect. 5.2) and the selectivity of the queries (see Sects. 5.2, 5.3). We compare the execution times of the different approaches when using the most widely used application programming interfaces for accessing XML data, DOM [43] and SAX [31] (see Sects. 5.2,5.3). We have used synthetic data (see Sects. 5.2, 5.4) and real data (see Sect. 5.3). Section 5.1 describes the experimental environment, which we have used for the rst two experiments. We present the third experiment (see Sect. 5.4) in order to show that our XSLT optimization approach speeds up the processing of XSLT stylesheets even if we use another computer conguration and various different XSLT stylesheets. We discuss and analyze the experimental results in Sect. 5.5.

5.1 Experimental environment

Our test system for the rst and second experiments is an Intel Pentium 4 processor 1.7GHz with 1GB RAM, Windows XP as operating system and Java VM Version 1.4.2. We use Xerces2 Java parser 2.5.0 release [4] as XML parser, Xalan-Java Version 2.5.1 [3] and Saxon Version 8.0 [27] as XSLT processors.

The prototype of our XSLT optimization approach supports a larger subset of XPath than we have described in the altered XPath evaluator in Sect. 8.2. The supported subset of XPath includes all XPath axes, all built-in functions, and, or and union operators and arbitrary comparsions (<,<=, >, >= and =).

5.2 The rst experiment set using synthetic data

Figure 17 contains the XSLT stylesheet which we have used in this experiment set.

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 359

Fig. 17 XSLT stylesheet used for the rst experiment set

We have generated the XML documents of different sizes according to the DTD in Fig. 18 (from 0MB up to 25MB), where the size of an XML document (or of a query result) is the number of bytes used for the textual representation of the XML document (or of the query results). The b attribute of the a tag contains a numeric value equally distributed in the interval of [0; 99].

For this experiment set, we use the query Q=/child::s/child::c[attribute::d<X]/attribute::d, which is applied to the transformation of the original document. We dene the selectivity of the query to be the size of its result divided by the size of the original document. Given the query above, our XSLT optimization approach optimizes the XSLT stylesheet of Fig. 17 to the XSLT stylesheet of Fig. 19. We use this optimized XSLT stylesheet in the following experiments. We run each experiment with both the XSLT processors, Xalan and Saxon. The average of ten experiments is presented.

The following experimental results compare our XSLT optimization approach with the common approach of transforming the entire XML document, and with the query transformation approach, and present the effects of query selectivity and size of XML data on the execution times of the three approaches. The two most widely used application programming interfaces to access XML data are DOM and SAX. Section 5.2.1 presents the experimental results when using a DOM tree as input and output; Sect. 5.2.2 presents the results when using SAX events as input and output.

Fig. 18 DTD used experiments

Fig. 19 Optimized XSLT stylesheet of Fig. 17

123

360 S. Groppe et al.

5.2.1 Using a DOM tree as input and output

First the XML document is parsed by a DOM parser and its DOM tree is generated. The second step is transforming XML data and the process of transformation depends on the individual approach:

Common approach: The entire DOM tree is the input of the XSLT processor which

transforms the DOM tree according to the XSLT stylesheet in Fig. 17 (see the lines representing the execution time of Q(S(D)) in Figs. 20, 21, 23, 24).

Query transformation: The query Q is reformulated to R according to the XSLT style-sheet in Fig. 17. In our experiments, the reformulation of Q to R needs less than 20ms. Then the relevant XML nodes are copied according to R and only the copied resultant

Fig. 20 Using Xalan with DOM and the query transformation approach

Fig. 21 Using Xalan with DOM and the XSLT optimization approach

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 361

Fig. 22 At which le size and which selectivity is our XSLT optimization approach faster than the query transformation approach when using Xalan with DOM?

Fig. 23 Using Saxon with DOM and the query transformation approach

XML fragment is the input of the XSLT processor (see [24]). The XSLT processor executes the XSLT stylesheet in Fig. 17 (see Fig. 20 for using the Xalan XSLT processor and Fig. 23 for using the Saxon XSLT processor). In these experiments, the operating system starts swapping main memory after 7MB, which disturbs the experiments. An out-of-memory error occurs at 9MB when 768MB main memory is reserved for the Java virtual machine. In comparison with transforming the entire XML document(i.e., the common approach), the query transformation approach is faster for queries with selectivity less than or equal to 30% when using the Xalan XSLT processor (see Fig. 20), and is faster for queries with selectivity less than or equal to 20% when using the Saxon XSLT processor (see Fig. 23).

123

362 S. Groppe et al.

Fig. 24 Using Saxon with DOM and the XSLT optimization approach

Fig. 25 At which le size and which selectivity is our XSLT optimization approach faster than the query transformation approach when using Saxon with DOM?

XSLT optimization: First, the XSLT stylesheet of Fig. 17 is optimized to the XSLT

stylesteet in Fig. 19. In our experiments, the optimization of the XSLT stylesheet needs less than 30ms. The entire DOM tree is the input of the XSLT processor, which processes the optimized XSLT stylesheet of Fig. 19 (see Fig. 21 for using the Xalan XSLT processor and Fig. 24 for using the Saxon XSLT processor). There does not exist an out-of-memoryerror at le sizes up to 10MB. In comparison with the common approach, our XSLT optimization approach is faster for queries with selectivity less than 80% when using the Xalan XSLT processor (see Fig. 21) and when using the Saxon XSLT processor (see Fig. 24). Our XSLT optimization approach is faster for any input document larger than 2MB in comparison with the query transformation approach (see Figs. 22, 25).

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 363

At last, we apply Q to the transformed XML data.

We have already compared the execution times of the query transformation and XSLT optimization approaches with the common approach before. We compare the execution time of the query transformation approach with that of the XSLT optimization approach in Fig. 22 when using Xalan and in Fig. 25 when using Saxon. Our XSLT optimization approach is faster at any selectivity for le sizes bigger than 2MB when using Xalan and for any le size when using Saxon. When the le sizes and the selectivities of the queries become larger, our XSLT optimization approach becomes faster compared with the query transformation approach. In the experiments, our XSLT optimization approach is two times faster for 60% selectivity at a le size of 5MB when using Saxon, and is up to four times faster for 90% selectivity at a le size of 7MB. For Xalan, our XSLT optimization approach is up to three times faster.

5.2.2 Using SAX events as input and output

Depending on the approach used, we proceed in the following way in order to answer a given XPath query Q on the data, which is the result of a transformation according to an XSLT stylesheet S with input XML document D:

Common approach: the entire input stream of the XML document is the input of the

XSLT processor, which excutes the XSLT stylesheet in Fig. 17 on the input stream (see the lines for Q(S(D)) in Figs. 26, 27, 29, 30).

Query transformation: rst Q is reformulated to R according to the XSLT stylesheet

in Fig. 17 (which needs less than 20ms in our experiments). Then a SAX Loader (an extended approach of [32]) evalulates R on the input stream and computes the resultant XML fragment, which is the input of the XSLT processor. The XSLT processor executes the XSLT stylesheet in Fig. 17 (see Fig. 26 for using the Xalan XSLT processor and Fig. 29 for using the Saxon XSLT processor). In comparison with the common approach, the query transformation approach is faster for queries with selectivity less than or equal to 80% when using the Xalan XSLT processor (see Fig. 26), and is faster for queries with selectivity less than or equal to 50% when using the Saxon XSLT processor (see Fig. 29).

Fig. 26 Using Xalan with SAX and the query transformation approach

123

364 S. Groppe et al.

Fig. 27 Using Xalan with SAX and the XSLT optimization approach

Fig. 28 At which le size and which selectivity is our XSLT optimization approach faster than the query transformation approach when using Xalan with SAX?

XSLT optimization: rst the XSLT stylesheet of Fig. 17 is optimized (which needs

less than 30ms in our experiments) to the XSLT stylesheet of Fig. 19. The entire input stream of the XML document is the input of the XSLT processor. The XSLT processor executes the optimized XSLT stylesheet of Fig. 19 (see Fig. 27 for using the Xalan XSLT processor and Fig. 30 for using the Saxon XSLT processor). In comparison with the common approach, our XSLT optimization approach is faster for queries with selectivity less than or equal to 60% when using the Xalan XSLT processor (see Fig. 27), and is faster for queries with selectivity less than or equal 70% when using the Saxon XSLT processor (see Fig. 30).

We apply the SAX Loader (an extended approach of [32]) according to Q to the transformed XML data. There does not exist an out-of-memory error in any of these experiments at le sizes up to 25MB.

We have shown several comparisons of the query transformation and XSLT optimization. Figure 28 presents the areas in which our XSLT optimization approach is faster than the

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 365

Fig. 29 Using Saxon with SAX and the query transformation approach

Fig. 30 Using Saxon with SAX and the XSLT optimization approach

query transformation approach in terms of the selectivity of the queries and the le size of the input XML document when using Xalan and Fig. 31 when using Saxon. Our XSLT optimization approach is faster for higher selectivities at smaller le sizes. For example, our XSLT optimization approach is faster for more than 60% selectivity at a le size of 5MB and is faster for more than 80% selectivity at a le size of 15MB when using Xalan (see Fig. 28). When using Saxon, our XSLT optimization approach is faster for more than 32% selectivity at a le size of 5MB and is faster for more than 36% selectivity at a le size of 15MB (see Fig. 31).

Our XSLT optimization approach has signicant advantages over the query transformation approach when using DOM, as the query transformation approach copies all relevant information of the input data when using DOM, and the copying process is time-consuming. The step of copying all relevant information of the input data is not necessary when using our XSLT optimization approach, because additional lter expressions are directly incorporated into the optimized XSLT stylesheet. Therefore, our XSLT optimization approach has considerable advantages over query transformation when using DOM. When using SAX, query transformation uses the SAX Loader to lter the irrelevant information, and the ltering

123

366 S. Groppe et al.

Fig. 31 At which le size and which selectivity is our XSLT optimization approach faster than the query transformation approach when using Saxon with SAX?

Fig. 32 XSLT stylesheet used for the transformation of the DBLP data

is very efciently implemented. Thus, the XSLT optimization approach does not perform signicantly faster than the query transformation when using SAX.

5.3 The second experiment set using the DBLP data set

In this set of experiments, we have used the rst 7.7MB of the DBLP data [40] of the version of the 7 February 2005, which contains a bibliography of publications. We have used the XSLT stylesheet of Fig. 32 for the experiments.

We have varied the selectivity of the queries by using two queries Q1 and Q2 with parameter X, where X {1990, 1992, 1994, 1996, 1998, 2000, 2002, 2004, 2006}. The query

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 367

Q1 returns the years of those entries, which are below the given parameter X and the query Q2 additionally returns the whole entries. In more detail, we have used the query Q1=/child::result/child::entry[child::Year < X]/child::Yearfor a selectivity up to 11.8% and we have used the query Q2=/child::result/child::entry[child::Year < X]/child::* for a selectivity of more than 11.8 %. The experiments measure the execution time which each approach costs to perform the queries with different selectivity on the DBLP data.

The document node of the used DBLP data set has 33,929 child nodes, each of which has 10 child nodes on average. Each of these 10 child nodes has a half child node on average. There are altogether only 88 nodes at the fth level and only one element node at the 6th level in the used DBLP data set. These kinds of documents, where each node has many child nodes on average, are called at XML documents. In comparison to at XML documents, each node in deep XML documents has few children on average.

Figure 33 present the results for the common approach, the query transformation approach and the XSLT optimization approach when using the Xalan XSLT processor with DOM, and Fig. 34 when using SAX. Figure 35 presents the results when using the Saxon XSLT processor with DOM, and Fig. 36 presents the results when using the Saxon XSLT processor with SAX.

The SAX Loader used in the query transformation approach can efciently lter irrelevant XML fragments even in at XML documents within one single scan. However, the XSLT optimization approach often adds complex lter expressions into the optimized XSLT stylesheet. The evaluations of these complex lter expressions often have to navigate through all the children in the input XML document, which is slower if there are many children as at input XML documents. Therefore, the query transformation approach performs better than the XSLT optimization approach when using SAX (see Figs. 34, 36) for the DBLP data, which is a at XML document. The XSLT optimization approach performs better than the query transformation approach when using DOM (see Figs. 33, 35), as the XSLT optimization approach avoids copying relevant XML nodes, which is time consuming. We provide a further analysis of the experimental results in Sect. 5.5.

Fig. 33 Transforming 7.7 MB DBLP data with Xalan using DOM

123

368 S. Groppe et al.

Fig. 34 Transforming 7.7 MB DBLP data with Xalan using SAX

Fig. 35 Transforming 7.7 MB DBLP data with Saxon using DOM

5.4 The third experiment set using different XSLT stylesheets

We present the third experiment in order to show that the XSLT optimization approach speeds up the processing of XSLT stylesheets even if we use (a) another computer conguration and(b) various different XSLT stylesheets.

Our test system for the third experiment is an Intel Pentium 4M processor 2GHz with 1GB

RAM, Windows XP as operating system, Java VM build Version 1.5.0 and its included XSLT processor. We have chosen the XSLT stylesheets and data sets of the XSLTMark benchmark

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 369

Fig. 36 Transforming 7.7 MB DBLP data with Saxon using SAX

[13]. The designers of the XSLTMark benchmark choose these XSLT stylesheets in order to cover as many aspects of the XSLT language as possible.

Table 5 contains the execution times of the XSLT optimization approach versus the execution times of the common approach and the query transformation approach using the XSLT stylesheets of the XSLTMark benchmark, SAX, the given input XML documents and XPath queries. The experimental results show that the XSLT optimization approach is either comparable to or faster than the query transformation approach for all the cases in this experiment. Both approaches are ususally faster than the common approach. A further analysis of the experimental results is presented in Sect. 5.5.

5.5 Discussion of experimental results

After analyzing all experimental results, we can characterize which types of XSLT stylesheets and which types of queries can be better optimized with which approach, XSLT optimization or query transformation.

XSLT stylesheets with an extensive usage of built-in templates: Built-in templates

apply the templates of an XSLT stylesheet recursively with all child nodes of the current input XML node such that it is difcult for a static analysis to exclude those templates, which are not called by the considered built-in template. For these XSLT stylesheets, the query transformation approach cannot signicantly project the input XML document to a sufcient XML fragment. Although the XSLT optimization approach cannot take advantage of the called template analysis either, it can still benet from the elimination of unnecessary XSLT code, and thus the execution times are optimized (see Sect. 5.4). For example, using our XSLT optimization approach, the optimized XSLT stylesheet of patterns.xsl is processed 25times faster in comparison with using the query transformation approach (see Table 5).

123

370 S. Groppe et al.

Table5ExecutiontimesoftheXSLToptimizationapproachversustheexecutiontimesofthecommonapproachandthequerytransformationapproachusingtheXSLT

stylesheetsoftheXSLTMarkbenchmark

XSLTInputXMLXPathExecutionSizeofXSLTExecutionSizeofreducedExecution

stylesheetdocumentquerytimestylesheettimeQueryinputdatatimecommon

XSLTfromXSLTTransformationbyQueryTransf.approachin

OptimizationOptimizationdividedinsecondsapproachdividedseconds

insecondsbysizeoforiginalbysizeoforiginal

inputdatainputdata

alphabetize.xsldb10000.xml/table/row0.27868%0.2319%1.175

avts.xsldb10000.xml/table/address[attribute::id=0001]0.16476%0.359%0.501

backwards.xsldb10000.xml/table/row/id[.=0001]0.19892%0.1939%0.992

creation.xsldb10000.xml/table/row/column0.81479%1.08785%1.153

current.xslcurrent.xml/items/rst0.0253%0.05337%0.029

nd.xsldepth.xml/element0.02582%0.0231%0.031

functions.xsldb10000.xml//extra0.20132%3.4679%26.060

identity.xsldb10000.xml/table/row0.20993%0.55385%0.715

oddtemplate.xsloddtemplate.xml/TOP/A[.=1]0.02976%0.031100%0.029

patterns.xsldb10000.xml/dgnorm_document/aranow/0.02987%0.736100%0.732

rstname[.=Al]

stringsort.xsldb10000.xml/table/row0.28498%0.1619%0.728

summarize.xsldb10000.xml/summary/total0.16537%0.1220,001%0.3

total.xslchart.xml/html/table/tr0.02352%0.02810%0.028

union.xslunion.xml/TOP0.02566%0.0288%0.028

xpath.xslxpath.xml/TOP/FOO[.=foo1]0.032106%0.03100%0.028

xslbench1.xslxslbench1.xml/HTML/BODY/TABLE/TBODY/TR/TD/

MAP[@name=FrontPageMap0]0.0236%0.1286%0.05

xslbench2.xslxslbenchdream.xml/html/body/h10.03153%0.0280,003%0.073

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 371

XSLT stylesheets where the name of generated elements do not depend on the input

and queries that address only a fragment of the output: Here, the XSLT optimization approach can eliminate unnecessary XSLT instructions, and thus the XSLT optimization approach has considerable advantages over the query transformation approach. If the eliminated XSLT instructions would be executed only once for the given input data, then the achieved speed up is not signicant. However, if the eliminated XSLT instructions would be often executed because they are, for example, in a loop, then the achieved speed up is signicant in the experiments (see the experiments in Sect. 5.4).

Complex queries and complex input path expressions: The XSLT optimization

approach often adds complex lter expressions into the XSLT stylesheet for handling complex queries, and the complex input path expressions in the XSLT stylesheet slow down the processing of the XSLT stylesheet (see Sect. 5.4). Therefore, in the case of complex queries and complex input path expressions, the XSLT optimization approach has fewer advantages than the query transformation approach. Although complex given queries often lead to complex transformed queries by the query transformation approach, the SAX Loader is more efciently implemented even for complex queries than current XSLT processors when evaluating complex lter expressions. Therefore, the query transformation approach has an advantage in this case in comparison to the XSLT optimization approach.

Flat versus deep input XML documents: The XSLT optimization approach often adds

complex lter expressions to the optimized XSLT stylesheet. The evaluations of these complex lter expressions often have to navigate through all the children in the input XML document, which is slower if there are many children like in at input XML documents (see Sect. 5.4). In deep XML documents, element nodes usually do not have many children, such that the evaluation of complex lter expressions added to the optimized XSLT stylesheets by the XSLT optimization approach does not signicantly slow down the processing of the optimized XSLT stylesheets. Therefore, if no other characteristics of the used query and the XSLT stylesheet signicantly inuence the execution times, then when querying deep XML documents, the XSLT optimization approach performs better; and when querying at XML documents, the query optimization approach performs better.

DOM versus SAX in the input source: The query transformation approach is significantly faster when using SAX in the input source, as the SAX Loader is very efciently implemented. In case of using DOM in the input source, the query transformation approach copies relevant XML nodes in order to project to the relevant XML fragment, which is time-consuming, but the XSLT optimization approach does not need to copy relevant XML nodes and just works on the DOM representation of the whole document. Therefore, the XSLT optimization approach has considerable advantages over the query transformation approach when using DOM (see Sect. 5.2).

6 Related work

Query reformulation is known from relational databases. In comparison to our approach, applying the reformulated query transforms the retrieved data into the target format in one step as follows [12]:

Given two schemas Forig and Ftransf and a correspondence S between them, nd a query R formulated in terms of schema Forig that is equivalent to a given query Q formulated in terms of schema Ftransf modulo the correspondence S.

123

372 S. Groppe et al.

In our approach, we perform the transformation of XML data in a separate step in order to use standard XSL processors and standard XPath evaluators wherever possible. As in the approach of query reformulation in relational databases, our approach does not use a direct connection to a database for transforming the query.

Unlike the XML data model, the classical relational model and the deductive data model have no hierarchy, treat element order as insignicant, and do not support identity. These properties of the XML data model have to be considered in query reformulation for XML. Therefore, query reformulation approaches for other data models cannot be applied in pure XML environments, where only XML is used. In comparison to the approaches where XML data and XML queries are rst mapped to relational databases and relational queries [16,26,28,3739,41], we support whole XPath [45] and whole XSLT [44]. Furthermore, we can avoid performance bottleneck which occurs whenever complex mappings are processed, by using an XML query reformulation approach that uses only XML data and XML languages.

For the transformation of XML queries into queries based upon other data formats, at least two major research directions can be distinguished. First, the mapping of XML queries to other data storage formats of object-oriented or of relational databases (e.g., [7,12]), and second, the transformation of XML queries or XML documents into other XML queries or XML documents based on another XML data format (e.g., [1]). We follow the second approach; however, we focus on XSL [44] for the transformation of both data and XPath [45] queries.

In related contributions to schema integration, two approaches to data and query translation can be distinguished. While the majority of contributions (e.g., [2,10]) map the data to a unique representation, we follow [8] and map the queries to those domains where the data resides.

The contribution in [11] contains query reformulation according to path-to-path mappings. We go beyond this, as we use XSLT as a more powerful mapping language. The work reported in [33] describes how XSL processing can be incorporated into database engines, but focuses on efcient XSL processing. The complexity of XPath query evaluation on XML documents is examined in [19]. In comparison, we use an evaluation based on output nodes of XSLT and consider XSLT optimization according to a query. Diao et al. [14,15] present algorithms to lter an XML document according to a given query and analyze the performance, but the algorithms do not contain XSLT optimization according to a query. Wang et al. [42] describes an approach to compress XML data, where queries can be evaluated on the compressed XML data. For this purpose, the queries have to be reformulated in order to work on the compressed data.

The contributions [17,20,36] introduce an algebra for XQuery. Additionally, they list transformation and optimization rules based on the introduced algebra, but they do not contain an XSLT variant of the optimization approaches presented here.

The study in [9] describes how the language XQuery can be extended to support views. It describes the language extensions but does not describe how to optimize them.

Many applications use XSLT in order to render XML data into HTML web pages. Mukherjea [34] describes a special search engine for web pages, which signicantly reduces the number of pages, which need to be downloaded. It would be interesting to extend the approach of [34] for usage of XSLT views in order to propagate queries to the original data of web pages rather than their rendered versions.

The approach proposed in [32] projects XML documents to a sufcient XML fragment before processing XQuery queries. It contains a static path analysis of XQuery queries, which computes a set of projection paths formulated in XPath from an arbitrary XQuery

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 373

expression. In comparison to this approach we describe, among other things, a path analysis in XSLT stylesheets depending on an input XPath query. Furthermore, we analyze paths in recursive calls (of templates) and we optimize the XSLT stylesheet by excluding unnecessary XSLT instructions. We project the input of subexpressions (instead of projecting only the input of a whole XQuery expression as in [32]) by inserting predicates.

The work in [29] describes how to transform XQuery expressions into XSLT stylesheets. This approach can be used in order to extend our proposed approach to support XQuery as query language and as view language.

Based on the approaches presented in [30,35], our approach for XSLT optimization could be extended for XML document similarity computation and clustering of XML data through XML views expressed by XSLT stylesheets.

Our previous contributions reported in [2225] deal with the query transformation approach where we do not optimize the XSLT stylesheet itself. In [24], we have introduced the XSLT processor model, which is summarized in this paper, for the query transformation approach. Furthermore, our work in [24] also contains a simple modied XPath evaluator in pseudo code, which we have extended for the XSLT optimization approach by support of additional axes. We have adapted the general XPath evaluator to the requirements of our new XSLT optimization approach. In [24], we describe how to extend the query transformation approach in order to support XSLT queries, which can be also used to extend our new XSLT optimization approach for the support of XSLT queries.

In contrast to all these approaches, in this paper, we focus on the optimization of XSLT stylesheets according to XPath queries by modifying the XSLT stylesheet (instead of projecting the input data to a sufcient part). Furthermore, we describe a sketch of an XSLT optimization algorithm, which supports whole XPath and whole XSLT, and we present a proof of its correctness. In experiments, we compare the execution times of the XSLT optimization approach with the common approach which transforms rst the entire XML data, and we compare with the query transformation approach which does not modify the XSLT stylesheet, but projects the input data to a sufcient part.

7 Summary and conclusions

Whenever XML data D given in an XML format Forig can be transformed by an XSLT stylesheet S into an XML format Ftransf, and a query expressed in terms of format Ftransf

has to be applied, to execute the query efciently, we need to avoid replicas, reduce the processing costs for document transformation by an XSLT processor, and reduce data shipping costs in distributed scenarios. For those purposes, we have introduced the XSLT optimization for XPath queries approach (XSLT optimization) and have compared it with the query transformation approach and the common approach.

In the common approach, we rst transform the entire data D according to an XSLT stylesheet S into S(D) and query S(D) afterwards according to a given query Q. The nal result of the common approach is Q(S(D)).

In the query transformation approach [24], we transform a given query Q by using a given XSLT stylesheet S into a query R. R can be applied to the input XML document D in order to retrieve a smaller XML fragment R(D), which contains all the relevant data. R(D) can be transformed by the XSLT stylesheet into S(R(D)), from which the query Q selects the relevant data.

123

374 S. Groppe et al.

In the proposed XSLT optimization approach, we optimize a given XSLT view according to a given query Q. This optimized XSLT view can be applied to the input XML document D, the result of which contains all the relevant data to answer Q.

We have shown by experimental results that both approaches, the XSLT optimization approach and the query transformation approach, out-perform the common approach in many cases, particularly in the case for queries on large XML documents. We have shown that our approach is scalable and becomes more efcient for larger XML documents. The analysis about the advantages of both approaches, the query transformation approach or the XSLT optimization approach, for which types of input XML documents and for which types of XSLT stylesheets is complex. We have discussed the analysis in Sect. 5.5.

Whenever XML data is transformed according to an XSLT stylesheet, all three approaches(i.e., the common approach, the XSLT optimization approach and the query transformation approach) can be used to enable on-demand transformations. Especially, the query transformation approach and the XSLT optimization approach can be incorporated into XML databases (and into relational databases, which are enabled to generate XML data from the relational data) in an efcient and scalable manner. The XSLT optimization approach transforms only the necessary data to answer a given query as the query transformation approach, but additionally, the XSLT optimization approach does not execute unnecessary XSLT instructions to answer a given query. Thus, the XSLT optimization approach outperforms the query transformation approach in most cases of the real world.

It is promising to investigate whether a combination of the query transformation approach and the XSLT optimization approach has considerable advantages in some application scenarios. One might think about a distributed scenario, where the source data are stored on a different computer than the one on which a transformation unit is processed. We can then determine the needed fragment of the input data by using the query transformation approach. Thus, we can already project the input data to the fragment, which is sufciently needed by the transformation unit on the computer where the input data are located, and we only transfer this fragment of the input data over the network to the computer executing the transformation unit instead of all input data, which is more time consuming. The transformation unit afterward uses the XSLT optimization approach to optimize the execution of the XSLT stylesheet, as the XSLT optimization executes only the sufcient XSLT instructions to answer a given query on its output instead of all XSLT instructions (when using the query transformation approach), which is more time consuming.

8 Appendix

The Appendix consists of 4 subsections. Section 8.2 describes the search on output nodes (compare with Sect. 3) of a given XSLT stylesheet with full technical details, while Sect. 8.1 provides an XSLT output data model, which is used in the denition of the search. Section 8.3 introduces basic denitions and provides the theoretical background for the proof of correctness, which we present in Sect. 8.4, of our XSLT optimization approach.

8.1 XSLT output data model

Based-on the data models for the XSL language given by [44], we develop a model for the output data of XSLT stylesheets. We use this XSLT output data model in order to dene the search on output nodes of a given XSLT stylesheet with full technical details in Sect. 8.2.

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 375

Denition 8 If the XSLT node N2 is a successor XSLT node of N1, we call N1 the predecessor XSLT node of N2.

An XSLT stylesheet is a set of nodes of type Node. There are four main specic Node types in an XML document, which are generated by output nodes of the XSLT stylesheet: root (type of the document node), iElem (type of element nodes), iAttr (type of attribute nodes) and iText (type of text nodes). Accordingly, we dene four functions of Node

Boolean to test the type of a node: isRoot, isiElem, isiAttr, and isiText, which return true if the generated XML node of the given XSLT node is a root node, is of type iElem, iAttr or iText respectively, otherwise these functions return false.

We use sequences to express the detected stylesheet paths and paths in the XSLT stylesheet. We denote < x1, . . ., xk > for a sequence of its contained elements x1, . . ., xk. We use the operator o to express the concatenation of sequences, i.e., < x1, . . ., xk > o < y1, . . ., ym > =< x1, . . ., xk, y1, . . ., ym >. Furthermore, we denote zn for an arbitrary repetition (including non-repetition) of nodes of a sequence z. We denote x[k] = xk for the k-th entry of

the sequence x. We use the type sequence(Node {f,b}) for the data structure to

represent paths in the XSLT stylesheet. We refer to the rst eld of an entry x[k] representing the node in the XSLT stylesheet with x[k].N and to the second eld representing the direction in the XSLT stylesheet with x[k].d, where f represents the forward direction and b the backward direction. Figure 37 denes relation functions with signature sequence(Node {f , b }) Set(sequence(Node {f,b})) and the helper

functions root (empty Node), last (sequence(Node {f,b}) Node

{f,b}), succe (Node Set(Node)), S (Node sequence(Node {f,b}))

and sp (sequence(Node) sequence(Node)). root(x) returns the root node of the

XSLT stylesheet in which x occurs; iChild (iAttr respectively) relates the last node of the sequence to its child nodes (attribute nodes, respectively). For computing iChild(x) and iAttr(x) an auxiliary function S(x) is dened, which relates the last node of the sequence x to its self node and all its successor nodes one node before an output node marking also loops. iDesc returns a sequence from the last XSLT node of the sequence x to all its descendant nodes marking also loops. The function sp(x) eliminates XSLT nodes with backward

last(<y>) if xk=yn, i.e. xk represents an arbitrary repetition of y (at least one repetition) last(<x1,,xk>) = x otherwiseroot() = { y | isRoot(y) and yS}, where S is the considered XSLT stylesheetsucce(x) = { y | y is a successor XSLT node of x}

S(x) = i=0 Si, where S = { <(y,f)> | ysucce(x) isiElem(y) isiAttr(y) }, Si={ y o <(z,f)> | ySi-1 zsucce(last(y).N) isiElem(z) isiAttr(z) (z,f)y}

{y o (k o <(z,f)>)n| y o k>Si-1zsucce(last(k).N)isiElem(z)isiAttr(z)(z,f)=last(y)} iChild(x) = { y o <(z,f)> | ((S(last(x).N)?{} yS(last(x).N) zsucce(last(y).N))

(S(last(x).N)={} y=<> zsucce(last(x).N))) ( isiElem(z) isiText(z) ) } iDesc(x) = i=0 C , where C0 = iChild(x), Ci = { y o z | yCi-1 z iChild(last(y)) last(z)y}

{ y o (k o <z>)n | y o k Ci-1 z iChild(last(y)) last(z)=last(y)} iAttr(x) = { y o <(z,f)> | ((S(last(x).N)?{} yS(last(x).N) zsucce(last(y).N))

(S(last(x).N)={} y=<> zsucce(last(x).N))) isiAttr(z)} iSibling(x) = { z o y | z iParent(x) yiChild(z)}iPS(x) = { y | yiSibling(x) last(y).N << last(x).N}iFS(x) = { y | yiSibling(x) last(x).N << last(y).N}iParent(<x1,,xk-1 >) = <(xk.N,b)> o iP(<x1,,xk-1>) if xkAn

iParent(<x1,,xk-1>) iParent(<x1,,xk-1,An,A>) if xk=An

iP (<x1,,xk-1,xk>)= { y | y=<(xk.N,b)> (isiElem(xk.N) xk.N=root())}

{ y | isiElem(xk.N) xk.N=root() y=<(xk,b)> o r riP(<x1,,xk-1>)} { y | xk=An y=iP(<x1,,xk-1,An,A>)}sp(x) = { <x1m1, , xkmk> | ximk=x[mk] ximk.d=f mi, where i<k:mi<mi+1(r>0: |{x[mi+j]|j<=r x[mi+j].d=b}|>r/2)} iAncestor(x) = i=0 Ai, where A0 = iParent(x), Ai = { y o z | yAi-1 ziParent(sp(x o y)) last(z)y}

Fig. 37 Used relation functions

123

376 S. Groppe et al.

direction in the sequence x and their corresponding XSLT nodes with forward direction. The function sp(x) is used for the computation of the sequence to the parent node by iParent(and its helper function iP) and of the sequence to the ancestor nodes by iAncestor, which detects and marks also loops. The relation function iSibling(x) relates to the sibling output nodes. iPS(x) (iFS(x), respectively) relates to the sibling output nodes that occur before (after, respectively) the last node of the sequence x in the document order. We write y << x to indicate that the node y occurs before the node x in the document order of the output. The document order is computed from an XSLT stylesheet in the following way: the document order of elements is the order in which they are dened except they are in the scope of iterations like <xsl:for-each> or <xsl:apply-templates>, where any element of this set of elements can occur before any other elements of this element set.

Let NodeTest be the type of the node test of XPath. An auxiliary function attr(x,name) retrieves the value of the attribute name of the node x. The function NT: Node

NodeTest Boolean, which tests an XSLT node against a node test of XPath, is dened

as: NT(x, *) = isiElem(x) isiAttr(x)

NT(x, label) = (isiElem(x) (attr(x, name)=label))

NT(x, text()) = isiText(x) (isiAttr(x) (attr(x, name)=

label)) NT(x, node()) = true

8.2 Altered XPath evaluator

However, our approach can be extended to full XPath so that in case of doubt, we can consider a larger fragment in the XSLT stylesheet to generate more output than needed. The XSLT optimization approach is still correct, as in the last step, the original XPath query is applied on the result of the XSLT stylesheet. This applies also to positional predicates of XPath expressions.

Our proposed search on output nodes evaluates an XPath query on an XSLT stylesheet, and computes a set of stylesheet paths to the possible output nodes specied by the XPath query.

In the following paragraphs, we describe the search on output nodes of the considered XSLT stylesheet, and dene the following notations. Let z be a pointer in a stylesheet path and d is a eld of a styleheet path record, we write z.d to refer to the eld d of the stylesheet path record to which the pointer z points.

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 377

Figure 38 denes the semantics L of the search on XSLT stylesheets, where we initialize the search by evaluateXPathOnXSLT(Q,S):=L(Q, ), where Q is the given XPath query and S the given XSLT stylesheet. The function L takes an XPath expression and a stylesheet path as argument and yields a set of new stylesheet paths, and is dened recursively on the structure of XPath expressions. For evaluating each location step of an XPath expression, our search rst computes the axis and the node test of the location step by iteratively taking the last XSLT node of the sequence p[|p|].SN from each stylesheet path p in the path set as the context node. The path set is computed from the part XP of the XPath expression, which has been evaluated, i.e., XP=XP/XP and XP = XPf/XPr,

where XPf is the current location step and XPr is the subexpression of XP, which still has to be evaluated after the current location step XPf. For each resultant node r selected by the current location step XPf, a new stylesheet path is generated based on the old path p. We use the minus operator (-) in order to eliminate a starting sequence p from a sequence p, i.e., p-p=p in the case of p=p o p, otherwise p-p=p.

The stylesheet paths of a predicate are attached to the context node of the predicate. Attached stylesheet paths f1, . . . , fk are computed from a set of predicates q1, . . . , qk. fiis either a stylesheet path list computed from a predicate qi, or is the predicate expression qi itself when qi does not contain location steps. The node last(p[i].SN).N is the context node of these predicates. qi is evaluated to false if qi is computed to the empty set with the exception of not (qi), which is computed to true. For example, L(e [q1 and

q2],p) is computed to empty set if q1 or q2 are false.

L: XPath expression stylesheet path [ XPath expression] set(stylesheet path)

1. L(e1|e2,p) = L(e1,p) L(e2,p)2. L(/e,p) = L(e/child::text(),p1) p1=< (/e,<(root(),f)>,) > if L(e/child::text(),p1) =< (/e,<(root(),f)>,) > otherwise

3. L(a::n/e,p) = { p L(e,p1) p1L(a::n,p,a::n/e) }4. L(a::n[q1 ]/e,p) = { p2 | p2L(e,p1) p1L(a::n[q1][qn],p,a::n[q1][qn]/e) }5. L(a::n/e,p,XP) = { p | p2L(e,p1,XP) p1L(a::n,p,XP) }6. L(a::n[q1 ]/e,p,XP) = { p2 | p2L(e,p1,XP) p1L(a::n[q1][qn],p,XP) }7. L(child::n,p,XP) = {p o <(XP,SN,)>| SNiChild(p[|p|].SN)NT(last(SN),n)}8. L(self::n,p,XP) = { p | NT(last(p[|p|].SN).N, n) }9. L(desc::n,p,XP) = { p o <(XP,SN,)> | SNiDesc(p[|p|].SN)NT(last(SN).N,n)}10. L(parent::n,p,XP) = { p o <(XP,SN,)>| SNiParent(sp(p[1].SN o o p[|p|].SN))NT(last(SN).N,n)}11. L(ances::n,p,XP) = { p o<(XP,SN,)>|SNiAncestor(sp(p[1].SN oo p[|p|].SN))NT(last(SN).N,n)}

12. L(DoS::n,p,XP)= L(self::n,p,XP) L(desc::n,p,XP)13. L(AoS::n,p,XP) = L(self::n,p,XP) L(ances::n,p,XP)14. L(FS::n,p,XP) = { p o <(XP,SN,)> | SNiFS(sp(p[1].SN o o p[|p|].SN))NT(last(SN).N,n) }15. L(following::n,p,XP) = L(AoS::node()/FS::node()/DoS::n,p,XP)16. L(PS::n,p,XP) = { p o <(XP,SN,)>| SNiPS(sp(p[1].SN o o p[|p|].SN))NT(last(SN).N,n) }17. L(preceding::n,p,XP) = L(AoS::node()/PS ::node()/DoS ::n,p,XP)18. L(attr::n,p,XP) = { p o <(XP,SN,)>| SNiAttr(p[|p|].SN)NT(last(SN).N,n)}19. L(e[q],p,XP)={<p[1], , p[|p|-1]> o <(XP,SN,f {p | pL(q,p) p=p-p})>| pL(e,p,XP) p[|p|]=(XP,SN,f)}20. L(self::node()=C,p) = L(text()=C,p if L(text()=C,p) {p o <(self::node()=C)>} otherwise

21. L(e = C,p) = L(e[self::node()=C],p), where e?self::node()22. L(e[q1][qn],p,XP)={<p[1],,p[|p|-1]>o<(XP,SN,f{(and,{p | pL(q1,p) p=p-p},, {p | p L(qn,p) p=p-p})>})|pL(e,p,XP)p[|p|]=(XP,SN,f)}

23. L(e[q1 and q2],p,XP) = L(e[q1][q2],p,XP)24. L(e[q1 or q2],p,XP) = L(e[q1]|e[q2],p,XP)25. L(e[q1 = q2],p,XP) = {<p[1], , p[|p|-1]>o<(XP,SN,f {(=, {p | p L(q1,p) p=p-p},{p | p L(q2,p) p=p-p})})>| pL(e,p,XP) p[|p|]=(XP,SN,f)}

26. L(e[not(q)],p,XP) = {<p[1], , p[|p|-1]> o <(XP,SN,f {(not, {p | p L(q,p) p=p-p})})>| pL(e,p,XP)p[|p|]=(XP,SN,f)}

27. L(,p) = {p}

Fig. 38 Formulas for constructing stylesheet paths. e and ei represent relative XPath expressions, a represents an XPath axis, n a node test, q and qi represent qualiers and C a constant, i.e., a literal string or number

123

378 S. Groppe et al.

Example 9 The single computation steps of applying the XPath query given in Fig. 11 to the XSLT stylesheet of Fig. 6 when using the formulas of Fig. 38 are given in Table 6. The nal result is presented in Fig. 12.

8.3 XSLT processor model

In this section, we outline the execution of an XSLT stylesheet by an XSLT processor and dene the state of the XSLT processor. This work is useful for the following aspects:

to understand how an XSLT processor transforms XML documents, to dene a formal XSLT processor model, to develop an XSLT stylesheet optimizing algorithm, and to prove the correctness of the XSLT optimizing algorithm (see Sect. 8.4) based on the

formal XSLT processor model.

8.3.1 Algorithms for XSLT stylesheet execution

To simplify the XSLT processor model, we reduce many cases of XSLT instructions by rst transforming XSLT instructions into their long form, i.e., XML nodes, which represent the direct output of the XSLT stylesheet (e.g., <elem>), are transformed into XSLT instructions in long form for generating the output (e.g., <xsl:element name=elem>). Let us consider the case that an XSLT instruction I contains an attribute A, which is set to the value of one or more XPath expressions, i.e., in the case of an XPath expression A={XP}, where XP is an XPath expression. In order to keep the XSLT processor model simple, we assume that the XSLT processor evaluates XP as it would be a child node <xsl:value-ofselect=XP> of I. Furthermore, other attributes of I containing an XPath expression are also executed as they would be a child node of I. Figure 39 presents the pseudocode algorithms of an XSLT processor, which executes XSLT stylesheets.

The rst algorithm processXSLTStylesheet (see Fig. 39) initializes the XSLT processor by selecting the rst input XML node of the input XML document iXML and the rst XSLT node N of the XSLT stylesheet (line 6). N contains the initial template, which may be an explicit or a built-in template, with the highest priority for matching the document node. The XSLT processor processes this template by calling the algorithm processXSLTNode(line 8).

The second algorithm processXSLTNode (see Fig. 39, line 922) executes a given single XSLT node N. Further parameters of the algorithm processXSLTNode include the current XML node K in the input XML document iXML and the last generated XML node oin the output XML document for adding new output to the output XML document relative to o. The algorithm processXSLTNode contains a loop iterating through the input node set of N (see line 1822). The input node set of the XSLT node N is the result of evaluating the XPath expression stored in a select attribute with context node K (see line 16) or is a set containing only the current input XML node K (see line 17) in the case that a new input node set is not declared by a select attribute of N.

The method executeOneXSLTNode (line 20) with three parameters, the current XSLT node N, the input XML node K, and the current XML node in the output XML document o, is used to execute the current XSLT node N of the XSLT stylesheet S. Since [44] describes the execution semantics of the different types of XSLT nodes, we do not describe the implementation of the method executeOneXSLTNode here. Note that the execution of an XSLT node (<xsl:apply-templates>, <xsl:apply-imports>

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 379

Table6ComputationstepsofapplyingtheXPathquerygiveninFig.11totheXSLTstylesheetofFig.6byusingtheformulasofFig.39.TheidentiersXPiaretakenfrom

Fig.11,theidentiersSiaretakenfromFig.6andtheidentiersEiaretakenfromFig.12

2L(XP2/child::text(),<E1>)3{p 2|p2L(XP3/child::text(),p1)[logicalandtext]p1L(child::paper,<E1>,

XP2/child::text())}

4L(XP3/child::text(),<E1,E2>)3{p 2|p2L(XP4/child::text(),p1)[logicalandtext]p1L(child::tableOfContents,

<E1,E2>,XP3/child::text())}

7{<E1,E2,E3>}

6L(XP4/child::text(),<E1,E2,E3>)4{p 2|p2L(XP6/child::text(),p1)[logicalandtext]p1L(child::section[child::

text()=Introduction],<E1,E2,E3>,XP4/child::text())}

7L(child::section[child::text()=Introduction],<E1,

E2,E3>,XP4/child::text())

19{<p[1],,p[|p|-1]>o<(XP,SN,f{p|

pL(child::text()=Introduction,p)p=p-p})>|p

L(child::section,<E1,E2,E3>,XP4/child::text())

p[|p|]=(XP,SN,f)}

8L(child::section,<E1,E2,E3>,

XP4/child::text())

7<E1,E2,E3,E4:=(XP4/child::text(),<(S9,f),(S10,f),((S17,f),

(S10,f))n,(S11,f)>,)>

21L(child::text()[self::node()=Introduction],<E1,E2,E3,E4>)

10L(child::text()[self::node()=Introduction],

<E1,E2,E3,E4>)

4{p 2|p2L(,p1)[logicalandtext]p1L(child::text()[self::node()=Introduction],

<E1,E2,E3,E4>,child::text()[self::node()=Introduction])}

StepTobedeterminedRulenumberDeterminedresult

applied

1L(XP1,)2L(XP2/child::text(),<E1>)

3L(child::paper,<E1>,XP2/child::text())7{<E1,E2>}

5L(child::tableOfContents,<E1,E2>,

XP3/child::text())

9L(child::text()=Introduction,<E1,E2,

E3,E4>)

123

380 S. Groppe et al.

19{<p[1],,p[|p|-1]>o<(XP,SN,f{p|

pL(self::node()=Introduction,p)p=p-p})>|p

L(child::text(),<E1,E2,E3,E4>,

child::text()[self::node()=Introduction])

p[|p|]=(XP,SN,f)}

12L(child::text(),<E1,E2,E3,E4>,

(child::text()[self::node()=Introduction],<S16,f>,)>}

13L(self::node()=Introduction,{<E1,E2,E3,E4,

7{<E1,E2,E3,E4,

20{<E1,E2,E3,E4,(child::text()[self::node()=Introduction],

<S16,f>,),({self::node()=Introduction})>}

27{<E1,E2,E3,E4,(child::text()[self::node()=Introduction],

<S16,f>,{self::node()=Introduction})>}

18<E1,E2,E3,E4,E5>

17L(child::text(),<E1,E2,E3,E4,E5>)3{p 2|p2L(,p1)[logicalandtext]p1L(child::text(),<E1,E2,E3,E4,

E5>,child::text())}

StepTobedeterminedRulenumberappliedDeterminedresult

15L(attribute::pages/child::text(),<E1,E2,E3,E4>)3{p 2|p2L(child::text(),p1)[logicalandtext]p1L(attribute::pages,<E1,

E2,E3,E4>,attribute::pages/child::text())}

11L(child::text()[self::node()=Introduction],<E1,E2,E3,

E4>,child::text()[self::node()=Introduction])

(child::text()[self::node()=Introduction],<S16,f>,)>})

18L(child::text(),<E1,E2,E3,E4,E5>,child::text())7<E1,E2,E3,E4,E5,E6>

19L((,<E1,E2,E3,E4,E5,E6>)27<E1,E2,E3,E4,E5,E6>

child::text()[self::node()=Introduction])

(child::text()[self::node()=Introduction],<S16,f>,

{self::node()=Introduction})>)

14L((,<E1,E2,E3,E4,

16L(attribute::pages,<E1,E2,E3,E4>,

attribute::pages/child::text())

Table6continued

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 381

Fig. 39 Algorithms to transform an input XML document iXML according to an XSLT stylesheet S into an output XML document oXML

or <xsl:call-template>) can cause the execution of a whole template. If the current XSLT node N generates an output XML node oz, the method executeOneXSLTNodealso attaches oz as a child node of the last node o in the output XML document. The method executeOneXSLTNode then returns the XML node oz in the output XML document (see Fig. 39, line 21), which serves as the last XML node in the output XML document in line 22. Here, the XSLT processor processes all child nodes DN of the current XSLT node N recursively by calling the method processXSLTNode with the parameters child node DN, current input XML node K, input XML document iXML and output XML node oz.

8.3.2 State of an XSLT processor

Considering the execution algorithm of an XSLT processor (see Fig. 39), we can dene the state of the XSLT processor as follows.

Denition 9 The state X of an XSLT processor that transforms an input XML document Daccording to an XSLT stylesheet S is a tuple (N, XP, K, I), where

N is the currently executed XSLT node of S; XP is a previous state, which represents the execution of the predecessor XSLT node of

N. If there are several of these previous states, then the last of them is chosen. We call XPthe predecessor state of X and dene a function predecessor(X), which returns the predecessor state XP (or null if X is the rst state);

K is the current input XML node; I is the input XML node set of the current XSLT node N with those XML nodes marked,

which are already processed.

Denition 10 Given an input XML document D and an XSLT stylesheet S, the execution sequence Z for D is the sequence of states of an XSLT processor while transforming Daccording to S.

123

382 S. Groppe et al.

Example 10 Figure 40 contains the execution sequence <X1, , X45> of the XSLT processor while the XSLT processor executes the XSLT stylesheet S in Fig. 6 on the XML document D in Fig. 5. In Fig. 40, we use the node identiers specied in Figs. 6 and 5.

Example 11 Figure 41 contains the execution sequence <X1, X2, X3, X4, X5,X10, X11, X12, X12a, X13, X14, X15, X18 , X19> of the states of the XSLT processor while executing the optimized XSLT stylesheet S in Fig. 9 on the XML document D in Fig. 5. In Fig. 14, we use the node identiers given in Figs. 9 and 5. A state Xi (except X12a) corresponds to the state Xi in Fig. 40. Note that the result of applying the query Q in Fig. 4 to the output of the execution sequence in Fig. 41 is the same as the result of applying the query Q in Fig. 4 applied to the output of the execution sequence in Fig. 40.

8.4 Proof of correctness

In this section, we prove the correctness of the XSLT optimization algorithm. More precisely, we prove that using our approach for XSLT optimization, which transforms a given XSLT stylesheet S into an optimized XSLT stylesheet S according to an XPath query Q, we retrieve the same result for Q(S(D)) and for Q(S(D)) for all XML documents D.

We prove the correctness by comparing the execution sequences of the XSLT processor with and without using our XSLT optimization approach. We check whether all relevant output is still generated when using our XSLT optimization approach and no other output

X1=((S1), -, /, {/})

X2=((B1), X1, /, {/})

X3=((B2), X2, (D1), {(D1)})

X4=((S2), X3, (D1), {(D1)})

X5=((S3), X4, (D1), {(D1)})

X6=((S4), X5,

(D1), {(D1)})

X8=((S6),X5,

(D1), {(D1)})

X10=((S8), X5, (D1), {(D1)})

X7=((S5), X6,

(D1,@label),

{(D1,@label)})

X9=((S7), X8,

X11=((S9),

X10, (D3),

{(D3), (D6), (D7)})

X28=((S9), X10, (D6),

{(D3), (D6), (D7)})

X37=((S9), X10, (D7),

{(D3), (D6), (D7)})

(D1,@writer),{

(D1,@writer)})

X29=((S10), X28, (D6), {(D6)})

X38=((S10), X37, (D7), {(D7)})

X12=((S10), X11, (D3), {(D3)})

X30=((S11),

X29, (D6),

{(D6)})

X36=((S17), X39=((S11), X38,

X45=((S17),

X38, -, {})

X13=((S11),

X12, (D3),

{(D3)})

X19=((S17),

X12, (D5),

{(D5)})

X29, -, {}) (D7), {(D7)})

X14=((S12),

X13, (D3),

{(D3)})

X16=((S14),

X13, (D3),

{(D3)})

X18=((S16), X20=((S10), X31=((S12), X33=((S14),

X35=((S16),

X30,

(D6,@label),{

(D6,@label)})

X40=((S12),

X39, (D7),

{(D7)})

X42=((S14), X44=((S16),

X13,

X19, (D5),

{(D5)})

X30, (D6),

{(D6)})

X30, (D6),

X39, (D7), X39, (D7,@label),

(D3,@label),

{(D6)})

{(D7)}) {(D7,@label)})

{(D3,@label)})

X32=((S13),

X31,

(D6,@size),{

(D6,@size)})

X34=((S15),

X33, (D6),

{(D6)})

X15=((S13),

X14,

(D3,@size),{

(D3,@size)})

X17=((S15), X16, (D3), {(D3)})

X41=((S13),

X40,

(D7,@size),

{(D7,@size)})

X43=((S15), X42,

(D7), {(D7)})

X21=((S11),

X20, (D5), {(D5)})

X27=((S17), X20, -,{})

X22=((S12),

X21, (D5),

{(D5)})

X24=((S14),

X21, (D5),

{(D5)})

X26=((S16), X21, (D5,@label),

{(D5,@label)})

X23=((S13),

X22,

(D5,@size),

{(D5,@size)})

X25=((S15), X24, (D5), {(D5)})

Xa is predecessor state of Xb

Fig. 40 Execution sequence <X1, . . . ,X45> of states of the XSLT processor when transforming the input XML document D in Fig. 5 according to the XSLT stylesheet S in Fig. 6

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 383

Fig. 41 Execution sequence <X1 , X2 , X3 , X4 , X5 , X10 , X11 , X12 , X12a , X13 , X14 , X15 , X18 , X19 > of the states of the XSLT processor when transforming the input XML document D in Fig. 5 according to the optimized XSLT stylesheet S in Fig. 9

nodes are generated when using our XSLT optimization approach. Our proof of correctness demonstrates in general how the correctnesses of such kind of optimization approaches can be proven by this example.

We let P be the set of paths, which contains all detected stylesheet paths and their branched paths computed by the search for output nodes described in Sect. 3.

Denition 11 Given a state x of an XSLT processor. We dene Predtrans(x) to be the set of states, which contains x and the transitive closure of the predecessor states of x, i.e., let P0(x) be {x} and let Pi(x) be predecessor(Pi1(x)) for i>0, then Predtrans(x) = i=0Pi(x).

Denition 12 Let Z be an execution sequence according to Denition 10. Then we dene the result of the function, excludeTestNodes(Z), to be an execution sequence Z of states, which contains all states of Z in the same order, except the states, the XSLT node of which is an inserted test node described in Sect. 4.4. Furthermore, the predecessor states XPof those states x = (N, XP, K, I) Z , where the predecessor states XP are excluded from

Z, are replaced with the predecessor state of the excluded state. This means that we set xto be (N, predecessor(XP), K, I) until the predecessor state is in Z.

Denition 13 Given an execution sequence Z of states for a given XML document. Let Z be the result of excludeTestNodes(Z). We denote x PZ , if p P so that the

following two conditions hold:(1) The state x Z of an XSLT processor and its predecessor states contain those executed

XSLT nodes that are the visited XSLT nodes of the detected stylesheet path p or its branched paths in the same order as in p,

X1 = ((S1), -, /, {/})

X2 = ((B1), X1, /, {/})

X3 = ((B2), X2, (D1), {(D1)})

X4 = ((S2), X3, (D1), {(D1)})

X5 = ((S3), X4, (D1), {(D1)})

X10 = ((S8), X5, (D1), {(D1)})

X11 = ((S9), X10, (D3), {(D3)})

X12 = ((S10), X11, (D3), {(D3)})

X12a = ((S10a), X12, (D3), {(D3)})

X19 = ((S17), X12, -, {})

X13 = ((S11), X12a, (D3), {(D3)})

X14 = ((S12), X13, (D3), {(D3)})

X18 =((S16), X13, (D3,@label), {(D3,@label)})

X15 = ((S13), X14, (D3,@size), {(D3,@size)})

Xa is predecessor state of Xb

123

384 S. Groppe et al.

(2) x Z : x Predtrans(x) : x Predtrans(x ) and x and its predecessor

states contain those executed XSLT nodes that are the visited XSLT nodes of the whole detected stylesheet path p in the same order.

We denote x PZ for the case that x is relevant for generating output that is relevant to

answer Q. We demand the requirement 2) because x can be on a branched path of p and is not necessarily directly on p. Furthermore, only in the case that a whole detected stylesheet path is executed (in x), the output that is relevant to answer Q is generated. Therefore, we only consider situations where a whole detected stylesheet path is executed.

Denition 14 Given an execution sequence Z of states for a given XML document and given x1, x2 Z, then x1 Z x2 denotes an order relation so that x1 is before the state x2 or in

the same place (i.e., x1=x2) in the execution sequence Z. Note that the order Z is different

from the predecessor relation (see Denition 9) between states of Z.

Denition 15 Given an execution sequence Z of states, then Z is a subsequence of Z iff

x Z : x Z and x1, x2 Z : x1 Z x2 <=> x1 Z x2.

Denition 16 Given an execution sequence Z of states and a state x1 Z, we dene the

distance of x1, which we denote by dist(x1), to the rst state in Z by counting the number of predecessor states until the rst state.

Example 12 In Fig. 40, the rst state is X1 and the distance of X37, i.e., dist(X37), is 6.

In the proof of correctness of the XSLT optimization algorithm, we use Lemmas 15 dened below. In these lemmas, we assume that an XSLT stylesheet S and an XPath query Q are given. Furthermore, the optimized XSLT stylesheet S is the result of the XSLT optimization algorithm with the input S and Q.

Denition 17 Let x1 and x2 be two states of an XSLT processor, let x1 = (N, XP1, K, I1)

and let x2 = (N, XP2, K, I2). Then we denote x1=* x2 iff the following two conditions

are true: 1) I1 I2 and 2) x1 and x2 are the rst states of the execution sequences or

XP1 = XP2.

We use x1=* x2 in order to express corresponding states in different execution sequences.

Denition 18 Let Z1 be an execution sequence and let Z2 be an execution sequence. We denote Z1 Z2 iff x1 Z1 : x2 Z2 : x1 = x2.

We use Z1 Z2 for the case that Z1 contains only states which have corresponding

states in Z2. Note that we do not consider the order of states in Z1 and Z2 for Z1 Z2.

Let ZS be the execution sequence of states of the XSLT processor while transforming the input XML document D according to the XSLT stylesheet S into S(D). Let ZS be the execution sequence of states of the XSLT processor while transforming the input XML document D according to the optimized XSLT stylesheet S into S (D). Furthermore, let ZSeTN be the result of excludeTestNodes(ZS ).

Lemma 1 ZSeTN ZS.

Proof of Lemma 1 Ignoring the inserted test nodes (see Sect. 4.4), S contains only XSLT nodes of S. Furthermore, the inserted lters in input nodes (see Sect. 4.3) and the inserted test nodes (see Sect. 4.4) restrict retrieved node sets. Therefore, all states in ZSeTN (with maybe a more restricted input XML node set) are also in ZS (but maybe not all states in ZSare also in ZSeTN ).

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 385

The basic idea of the lemmas, Lemmas 25, is to prove the following condition E for the subsequences in ZS and for the corresponding subsequences in ZSeTN .

Denition 19 Let subZS be a subsequence of ZS and let subZSeTN be a subsequence of ZSeTN . The condition E(subZS,subZSeTN ) is true for subZS and subZSeTN iff

x1, x2 subZS the following holds: x1 subZS x2, x1 PZS and x2 PZS <=> x 1, x 2 subZSeTN: x1=* x1, x2=* x2, x1 subZSeTN x 2, x 1

PZSeTN and x 2 PZSeTN .

Finally, we prove E(ZS,ZSeTN) in Lemma 5, which is used in Proposition 3 for the proof of correctness of the XSLT optimization algorithm.

Lemma 2 Let xLast ZS be a state where xLast PZS. Then x Last ZSeTN : x Last =

xLast and x Last P ZSeTN.

Proof of Lemma 2 The combination steps of ipe of a single XSLT node on a detected stylesheet path (or its branched paths) are designed so that the evaluation of the whole ipeas Boolean expression returns true if the current input node is relevant to answer Q. The constructed ipes are used to restrict the input node sets (see Sect. 4.3) and are used for the insertion of test nodes (see Sect. 4.4). All XSLT nodes on detected stylesheet paths are also available in S . Thus, if xLast PZS, then x Last ZSeTN : x Last = xLast and

x Last PZSeTN . Lemma 3 Let xLast ZS be a state where xLast PZS. Let xLast ZSeTN : xLast =

xLast and xLast PZSeTN that exists (see Lemma 2). We dene LSline20 in the following

way:Let LSline20 be a subsequence of ZS where LSline20 =< x|x ZS predecessor

(x) = xLast x is generated by line (20) of state x Last >.

We also analogously dene the subsequences of ZSeTN and the states generated in the function call of line (22):

Let LSeTN line21 be a subsequence of ZSeTN where LSeTN line21 =< x|x ZSeTN

predecessor(x) = xLast x is generated by line (20) of state x Last >.

Let LSline22 be a subsequence of ZS where LSline22 =< x|x ZS predecessor

(x) = xLast x is generated by line (22) of state xLast >.

Let LSeTN line23 be a subsequence of ZSeTN where LSeTN line23 =< x|x ZSeTN

predecessor(x) = xLast x is generated by line (22) of state xLast >.

Then E(LSline20, LSeTN line21) and E(LSline22, LSeTN line23) hold.

Proof of Lemma 3 We conclude E(LSline20, LSeTN line21) and E(LSline22, LSeTN line23) by considering the algorithm, processXSLTNode (see Fig. 39). Note that the combination steps of ipe of a single XSLT node on a detected stylesheet path (or its branched paths) for the computation of the lter to restrict the input node sets (see Sect. 4.3) and for the insertion of test nodes (see Sect. 4.4) are designed so that the following holds. Let W be the XML node set which is retrieved for S by the internal XPath evaluator in line (16) of Fig. 39 for the states in LSline20 LSline22. Let WS W be the XML node set which

is retrieved for S by the internal XPath evaluator in line (16) of Fig. 39 for an arbitrary state in LSeTN line21 LSeTN line23. For all states x1 LSline20 LSline22, where

x1 PZS, the current context node K of x1 is also in WS .K WS because we restrict the

retrieved node set in this way (see Sect. 4.3) and inserted test nodes tn (see Sect. 4.4). The test of tn is true for the current context nodes, the further processing of which leads to an execution of a whole detected stylesheet path. Therefore, x1 ZSeTN : x1 = x1. Furthermore,

123

386 S. Groppe et al.

x Predtrans(x1) : x2 PZS : (x Predtrans(x2) x2 ZSeTN : x2 =

x2) as we designed the XSLT optimization algorithm so that the following two conditions hold. First, all XSLT nodes of P are also available in S. Second, the inserted test nodes in S (see Sect. 4.4) and inserted lters in the input nodes of S (see Sect. 4.3) are computed in this way that these test nodes and lters only exclude current input nodes that are not relevant to answer Q. Therefore, x2 PZSeTN . Considering also Lemma 1, we prove Lemma 3.

Lemma 4 Let SZSold be a subsequence of ZS, let SZSeTN old be a subsequence of ZSeTNwhere E(SZSold, SZSeTN old) holds. Let xLast SZSold be a state where xLast PZS.

As E(SZSold, SZSeTN old) holds, xLast ZSeTN old : xLast = xLast and xLast

PZSeTN. Furthermore, we require the following to hold: xZSold :predecessor(x) =

xLast and x ZSeTN old : predecessor(x ) = xLast .

We dene SZSnew to be a subsequence of ZS, where SZSnew =< x|x SZSold (x ZS predecessor(x) = xLast) >, and SZSeTN new to be a subsequence of ZSeTN,

where SZSeTN new =< x|x SZSeTN old (x ZSeTN predecessor(x) =

xLast ) >.Then condition E(SZSnew, SZSeTN new) holds

Proof of Lemma 4 SZSnew can be divided into four subsequences, SZSnew1, LSline20, LSline22 and SZSnew2 of SZSnew so that SZSnew = SZSnew1 LSline20 LSline22

SZSnew2, where denotes the concatenation of sequences:

(1) SZSnew1 consists of the states from the rst state in SZSold to the state xLast including xLast itself.

(2) After xLast, the states of LSline20 occur in SZSnew.(3) After the states of LSline20, the states of LSline22 occur in SZSnew.(4) Let xLast+1 be the state after xLast in SZSold. The fourth subsequence SZSnew2

contains xLast+1 and all states after xLast+1 of SZSold.

Considering the denitions of the execution sequences and the algorithm processXSLTNode in Fig. 39, rst all states of SZSnew1 occur in SZSnew, then all states of LSline20, all states of LSline22 and nally all states of SZSnew2 occur in SZSnew. There are no other states in SZSnew. We can dene subsequences of SZSeTN new depending on xLast and xLast+1 where xLast+1 is the state after xLast in SZSeTN new, analogously, so

that rst all states of SZSeTN new1 occur in SZSeTN new, then all states of LSeTN line21, all states of LSeTN line23 and nally all states of SZSeTN new2 occur in SZSeTN new. There are no other states in SZSeTN new.

Considering the presumption of Lemma 4 and considering Lemma 3, the condition E(x,y) holds for each single subsequence x in SZSnew and its corresponding subsequence y in SZSeTN new, i.e., E(SZSnew1, SZSeTN new1), E(LSline20, LSeTN line21), E(LSline22, LSeTN line23) and E(SZSnew2, SZSeTN new2) hold. We now have to prove that E(SZSnew, SZSeTN new) holds. For this purpose, we arbitrarily choose two states x1 and x2 in SZSnew, where x1 PZS and x2 PZS. Then x1 = x2 SZSeTN new, x1 = x1,

x2 = x2, x1 PZSeTN and x2 PZSeTN because this is a requirement of the

condition E and E(SZSnew1, SZSeTN new1), E(LSline20, LSeTN line21), E(LSline22, LSeTN line23) and E(SZSnew2, SZSeTN new2) hold.

We consider two different cases for x1 and x2, x1 and x2.

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 387

(a) If x1,x2 are both elements of the same subsequence, SZSnew1, LSline20, LSline22

and SZSnew2 of SZSnew and x1, x2 are both elements of its corresponding subsequence SZSeTN new1, LSeTN line21, LSeTN line23 and SZSeTN new2 of SZSeTN new, then x1 SZSnew x2, x1 PZS and x2 PZS <=> x1 SZSeTNnew

x2 , x1 = x1, x2 = x2, x1 PZSeTN and x2 PZSeTN .

(b) x1, x2 are elements of different subsequences, SZSnew1, LSline20, LSline22 and SZSnew2 of SZSnew and x1, x2 are elements of their corresponding subsequences SZSeTN new1, LSeTN line21, LSeTN line22 and SZSeTN new2 of SZSeTN new. Let x* be the rst state of LSline23 and let x* be the rst state of LSeTN line23. Let us rst consider the case that x1 SZSnew1 and x2 is an element of LSline20, LSline22

or SZSnew2. Then x1 SZSnew xLast SZSnew x2 and because of the transitive

property of SZSnew, x1 SZSnew x2. Furthermore, x1 SZSeTN new xLast SZSeTN new x2 and because of the transitive property of SZSeTN new, x1 SZSeTN new x2 . For all cases, we set analogously x1 and x2 into SZSnew relation

to xLast, x and xLast+1 and x1 and x2 into SZSeTN new relation to xLast , x and

xLast+1 . We use the transitive property of the order relations SZSnew and SZSeTN new,

and use the relations xLast SZSnew x SZSnew xLast+1 and xLast SZSeTN new

x SZSeTN new xLast+1 to conclude x1 SZSnew x2, x1 PZS and x2 PZS

<=> x1 SZSeTN new x2 , x1 = x1, x2 = x2, x1 PZSeTN and x2 PZSeTN . Therefore, condition E(SZSnew, SZSeTN new) holds Lemma 5 Condition E(ZSeTN,ZS) holds.

Proof of Lemma 5 If there is no detected stylesheet path, then Lemma 5 is true. Otherwise, there is at least one detected stylesheet path. Let ZS(n) and ZSeTN(n) be the subsequence which contains all states with at most the distance n to xStart in ZS and in ZSeTN, respectively. In other words:

Let ZS(n) be a subsequence of ZS where ZS(n) =< x|x ZS, dist(x) <= n >.

Let ZSeTN(n) be a subsequence of ZSeTN where ZSeTN (n) =< x|x ZSeTN ,

dist(x) <= n >.

Then we prove E(ZS(n),ZSeTN(n)) (see Denition 19) by induction on variable n. As ZS and ZSeTN are limited in the number of states, we prove Lemma 5 by the following induction proof:

Base clause (n = 0): The rst state xStart in ZS and also in ZSeTN is always

(N =< xsl : stylesheet >, XP = , K = /, I = {/} with / marked). Therefore,

ZS(0) = ZSeTN (0) =< xStart >, where xStart = (N =< xsl : stylesheet >,

XP = , K = /, I = {/} with / marked ) . xStart PZS and xStart PZSeTN, as detected

stylesheet paths always start with the XSLT node < xsl : stylesheet >. Furthermore,

xStart ZS xStart and xStart ZSeTN xStart so that E(ZS(0),ZSeTN(0)) holds.

In order to prove the recursion clause, we assume that E(ZS(n),ZSeTN(n)) holds. At rst, we set SZSold(n) = ZS(n) and SZSeTN old(n) = ZSeTN (n). Then, we iterate

through the set ZLast = {xLast ZS | xLast PZS dist(xLast) = n}. Let xLast be

the currently considered state of ZLast. Then xLast ZSeTN : xLast = xLast and

xLast PZSeTN (see Lemma 2). Let SZSnew(n) be a subsequence of ZS, where

SZSnew(n) = <xZS|xSZSold(n) predecessor(x)=xLast >.

Let SZSeTNnew(n) be a subsequence of ZSeTN, whereSZSeTN new(n) =< x ZSeTN |x SZSeTN old(n) predecessor(x) =

xLast >.We prove E(SZSnew(n), SZSeTN new(n)) by using Lemma 4.

123

388 S. Groppe et al.

After proving the condition E(SZSnew(n), SZSeTN new(n)), we set SZSold(n) to SZSnew(n), SZSeTN old(n) to SZSeTN new(n) and consider the next state of ZLast until all states of ZLast are considered and E(ZS(n+1),ZSeTN(n+1)) is proved.

Thus, condition E(ZS,ZSeTN) holds.

Proposition 3 Let S be an XSLT stylesheet and let Q be an XPath query. Then the XSLT optimization algorithm transforms S into an optimized XSLT stylesheet S so that for every input XML document D the following holds: Q(S(D)) returns the same result as Q(S(D)).

Proof of Proposition 3 We compute the detected stylesheet paths of Q in S. Then we can conclude from Lemmas 1 to 5: there is no other output generated in ZS, which is not generated in ZSeTN, but is relevant to answer the query Q. ZS is identical to ZSeTN ignoring that ZS contains additionally states executing inserted test nodes, which do not generate any output. Thus, there is also no other output generated in ZS. Therefore, Q(S(D)) returns the same result as Q(S(D)) for every XML document D.

References

1. Abiteboul S (1999) On views and XML. In: Proceedings of the eighteenth ACM SIGACT-SIGMODSIGART symposium on principles of database systems (PODS), Philadelphia, Pennsylvania

2. Abiteboul S, Cluet S, Milo T (1997) Correspondence and translation for heterogeneous data. In: Proceedings of the 6th international conference on database theory (ICDT), Delphi, Greece

3. Apache Software Foundation (2003) Xalan-Java. http://xml.apache.org/xalan-j/index.html

Web End =http://xml.apache.org/xalan-j/index.html 4. Apache Software Foundation (2003) Xerces2 Java Parser 2.5.0 Release. http://xml.apache.org/xerces2-j

Web End =http://xml.apache.org/xerces2-j 5. Bttcher S, Steinmetz R (2004) Optimized Internet search based on an intersection test for XPath expressions under a DTD, In: Proceedings of the international conference on internet computing (IC), Las Vegas, USA

6. Bttcher S, Trling A (2003) Checking XPath expressions for synchronization, access control and reuse of query results on mobile clients. In: Proceedings of the workshop on database mechanisms for mobile applications, Karlsruhe, Germany

7. Bourret R, Bornhvd C, Buchmann, AP (2000) A generic load/extract utility for data transfer between XML documents and relational databases. In: Proceedings of the 2nd international workshop on advanced issues of EC and Web-based information systems (WECWIS), San Jose

8. Chang CCK, Garcia-Molina H (2000) Approximate query translation across heterogeneous information sources. In: Proceedings of the 26th international conference on very large data bases (VLDB), Cairo, Egypt

9. Chen YB, Ling TW, Lee ML (2002) Designing valid XML views. In: Proceedings of the 21st international conference on conceptual modeling (ER), Tampere, Finland

10. Cluet S, Delobel C, Simon J, Smaga K (1998) Your mediators need data conversion! In: Proceedings ACM SIGMOD international conference on management of data, Seattle, Washington, USA

11. Cluet S, Veltri P, Vodislav D (2001) Views in a large scale XML repository. In: Proceedings of 27th international conference on very large data bases (VLDB), Roma, Italy

12. Deutsch A, Tannen V (2003) Reformulation of XML queries and constraints. In: Proceedings of the 9th international conference on database theory (ICDT) 2003, Siena, Italy

13. Developer (2005) XSLT Mark Version 2.1.0. http://www.datapower.com/xmldev/xsltmark.html

Web End =http://www.datapower.com/xmldev/xsltmark.html 14. Diao, Y, Altinel, M, Franklin, MJ, Zhang, H, Fischer, P: Path sharing and predicate evaluation for high-performance XML ltering. ACM Trans Database Syst 28 (4), 467516 (2003)

15. Diao Y, Rizvi S, Franklin MJ (2004) Towards an Internet-Scale XML dissemination service. In: Proceedings of the thirtieth international conference on very large data bases (VLDB), Toronto, Canada

16. Fernndez, M, Kadiyska, Y, Suciu, D, Morishima, A, Tan, WC: SilkRoute, a framework for publishing relational data in XML. ACM Trans Database Syst 27 (4), 438493 (2002)

17. Fisher D, Lam F, Wong RK (2004) Algebraic transformation and optimization for XQuery. In: Proceedings of the 6th Asia-Pacic Web conference (APWeb), Hangzhou, China

18. Franceschet M (2005) XPathMarkan XPath benchmark for the XMark generated data. In: Proceedings of the third international XML database symposium (XSym 2005), Trondheim, Norway

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 389

19. Gottlob G, Koch C, Pichler R (2003) The complexity of XPath query evaluation. In: Proceedings of the 22th ACM SIGMOD-SIGACT-SIGART symposium of principles of database systems (PODS), San Diego, California, USA

20. Grinev M, Kuznetsov S (2002) Towards an exhaustive set of rewriting rules for XQuery optimisation: BizQuery experience. In: Proceedings of the 6th East European conference on advances in databases and information systems (ADBIS), Bratislava, Slovakia

21. Grinev M, Pleshachkov P (2005) Rewriting-based Optimization for XQuery transformational queries. In: Proceedings of the 9th international database engineering and applications symposium (IDEAS 2005), Montreal, Canada

22. Groppe S, Bttcher S (2003) XPath query transformation based on XSLT stylesheets. In: Proceedings of the fth international workshop on web information and data management (WIDM), New Orleans, Louisiana, USA

23. Groppe S, Bttcher S, Birkenheuer G (2004) Efcient querying of transformed XML Documents. In: Proceedings of the 6th international conference on enterprise information systems (ICEIS), Porto, Portugal

24. Groppe, S, Bttcher, S, Birkenheuer, G, Hing, A: Reformulating XPath queries and XSLT queries on XSLT views. Data Knowl Eng J (DKE) 57 (1), 64110 (2006)

25. Groppe S, Bttcher S, Heckel R, Birkenheuer G (2004) Using XSLT stylesheets to transform XPath queries. In: Proceedings of the eighth East-European conference on advances in databases and information systems (ADBIS), Budapest, Hungary

26. Jain S, Mahajan R, Suciu D (2002) Translating XSLT programs to efcient SQL queries. In: Proceedings of the eleventh international world wide web conference (WWW2002), Honolulu, Hawaii, USA

27. Kay MH (2004) Saxon - The XSLT and XQuery Processor. http://saxon.sourceforge.net

Web End =http://saxon.sourceforge.net 28. Krishnamurthy R, Kaushik R, Naughton JF (2004) Efcient XML-to-SQL query translation: where to add the intelligence? In: Proceedings of the thirtieth international conference on very large data bases (VLDB), Toronto, Canada

29. Lechner S, Preuner G, Schre M (2001) Translating XQuery into XSLT, In: ER Workshops, Yokohama, Japan

30. Leung, Hp, Chung, KFL, Chan, SCf: On the use of hierarchical information in sequential mining-based XML document similarity computation. Knowl Inf Syst 7 (4), 476498 (2005)

31. Megginson D (2000) SAX. http://www.saxproject.org/

Web End =http://www.saxproject.org/ 32. Marian A, Simon J (2003) Projecting XML documents. In: Proceedings of the 29th international conference on very large data bases (VLDB), Berlin, Germany

33. Moerkotte G (2002) Incorporating XSL processing into database engines. In: Proceedings of the 28th international conference on very large data bases (VLDB), Hong Kong, China

34. Mukherjea, S: Discovering and analyzing world wide web collections. Knowl Inf Syst 6(2), 230241 (2004)

35. Nayak, R: Fast and effective clustering of XML data using structural information. Knowl Inf Syst 14 (2), 197215 (2008)

36. Paparizos S, Wu Y, Lakshmanan LVS, Jagadish HV (2004) Tree logical classes for efcient evaluation of XQuery. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD), Paris, France

37. Rys M (2001) Bringing the Internet to your database: Using SQL Server 2000 and XML to build loosely coupled systems. In: Proceedings of the 17th international conference on data engineering (ICDE), Heidelberg, Germany

38. Shanmugasundaram J, Kiernan J, Shekita E, Fan C, Funderburk J (2001) Querying XML views of relational data. In: Proceedings of 27th international conference on very large data bases (VLDB), Roma, Italy

39. Shanmugasundaram, J, Shekita, E, Barr, R, Carey, M, Lindsay, B, Pirahesh, H, Reinwald, B: Efciently publishing relational data as XML documents. VLDB J 10 (23), 133154 (2001)

40. University Trier (2005) Computer Science Bibliographie. http://dblp.uni-trier.de/

Web End =http://dblp.uni-trier.de/ 41. Wang L, Mulchandani M, Rundensteiner EA (2003) Updating XQuery views published over relational data: a round-trip case study. In: Proceedings of the rst international XML database symposium (XSym), Berlin, Germany

42. Wilfred, Ng, Lam, WY, Wood, PT, Levene, M: XCQ: a queriable XML compression system. Knowl Inf Syst 10 (4), 421452 (2006)

43. W3C (2004) Document Object Model (DOM) Level 3 Core Specication Version 1.0, W3C Recommendation. http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/

Web End =http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/

44. W3C (2001) Extensible Stylesheet Language (XSL). W3C Recommendation. http://www.w3.org/Style/XSL/

Web End =http://www.w3.org/Style/ http://www.w3.org/Style/XSL/

Web End =XSL/

123

390 S. Groppe et al.

45. W3C (1999) XML Path Language (XPath) Version 1.0. W3C Recommendation. http://www.w3.org/TR/xpath/

Web End =http://www.w3.org/TR/ http://www.w3.org/TR/xpath/

Web End =xpath/

Author biographies

Sven Groppe earned his diploma degree in Informatik (Computer Science) from the University of Paderborn in 2002 and his Doctor degree from the University of Paderborn in 2005. From 2005 to 2007, he worked as postdoc in the University of Innsbruck. He is currently working as postdoc in the University of Lbeck. In 2001/2002, he worked in the project B2B-ECOM, which dealt with distributed internet market places for the electrical industry. From 2002 to 2004, he worked in the project MEMPHIS in the area of premium services. From 2005 to 2006, he worked in the projects ASG and TripCom in the areas of Semantic Web Services. All projects were funded by the European Union. He was a member of the DAWG W3C Working Group, which developed SPARQL. He is currently the project leader of LUPOSDATE, a German national project funded by the DFG, which deals with Semantic Web data streams.

Jinghua Groppe earned her Bachelor degree in Computer Science and Applications from the Beijing Polytechnic University in 1989 and her Masters degree in Computer Science from the University of Amsterdam in 2001. She worked as Software Engineer in the Chinese Academy of Launch Vehicle Technology/China Aerospace Corporation from 1989 to 1999. She was Scientic Employee in the Department of Computer Science/University of Paderborn from 2001 to 2005 and in the Institute of Computer Science/University of Innsbruck from 2005 to 2006. She is currently a scientic employee in the Univeristy of Lbeck, Germany, and has recently submitted her Doctorate thesis. She worked in the projects EUQOS, UBISEC, EColleg, VHE and ASG. All projects were funded by the European Union. She is currently working in the project LUPOSDATE, a German national project funded by the DFG, which deals with Semantic Web data streams.

123

Optimizing the execution of XSLT stylesheets for querying transformed XML data 391

Stefan Bttcher is a Professor of Computer Science at the University of Paderborn. His research areas cover query optimization and compression in XML databases, transactions in mobile ad-hoc networks, security and privacy. Before he joined University of Paderborn, he worked for several years at the German research labs of IBM and Daimler Benz.

Thomas Wycisk was a student at the University of Paderborn and has developed an early version of the prototype for the XSLT optimization approach as part of his student research project.

Le Gruenwald is the Presidential and David W. Franke Professor and Director of the School of Computer Science at The University of Oklahoma and a Program Director of Information Integration and Informatics at National Science Foundation. She received her Ph.D. in Computer Science from Southern Methodist University in 1990. She was a Software Engineer at White River Technologies, a Lecturer in the Computer Science and Engineering Department at Southern Methodist University, and a Member of Technical Staff in the Database Management Group at the Advanced Switching Laboratory of NEC, America. Her major research interests include Database Management, Information Privacy and Security, Data Mining and Bioinformatics. She has published more than 150 technical articles in journals, books and conference proceedings. She is a member of ACM, SIGMOD, and IEEE Computer Society.

123

Word count: 24129

Show less

Springer-Verlag London Limited 2009

Optimizing the execution of XSLT stylesheets for querying transformed XML data

Content area

Abstract

Full text