DB2 pureXML: Tackling technology issues in a hybrid relational and XML database engine

Innovation Matters


IBM Research enhanced the performance of database queries by focusing on three issues: Performance improvements, simplification of XML indexing and search/discovery of XML content repositories.

XML has emerged as the predominant mechanism for representing and exchanging structured and semi-structured information across the Internet, within an intranet and between applications. Key benefits of XML are its vendor and platform independence and its high flexibility. Within the enterprise, XML represents actual business artifacts, such as derivatives contracts, mortgages, and legislative and legal documents. XML-based RSS and Atom also are used extensively for publishing and subscribing to Internet Web feeds. These activities frequently require extensive search and discovery capabilities. While XML’s original intent was data exchange, an increasing amount of XML is designed to be stored. To address the demand for persistent storage, several XML management systems have been developed. Over the last couple of years, IBM Research has partnered with SWG to manage and query XML data in DB2.


Released in September 2006, DB2 pureXML is a hybrid relational and XML database engine. Using the XML data type introduced by SQL/XML, it provides native XML storage, indexing, navigation and query processing through SQL/XML and XQuery. See Figure 1 for a view of the overall system architecture.


DB2 pureXML -- the first database management system (DBMS) to provide native XML storage -- stores XML data in columns of relational tables as instances of the XQuery data model. By storing binary representation of type-annotated trees, DB2 pureXML avoids repeated parsing and validation of documents. It also provides selective value indexes on XML, specified by XPath expressions. These path expressions can contain wildcards, and descendant axis navigation, as well as kind tests -- all used to evaluate value predicates over XML columns. DB2 pureXML is the only major DBMS that provides selective indexing over XML data.

This database management system provides a SQL/XML and XQuery interface. These two languages are composable: XQuery can be invoked from SQL and vice versa. DB2 pureXML uses a single integrated query compiler for SQL/XML and XQuery. There is no translation from XQuery to SQL. After parsing, SQL/XML and XQuery queries are mapped into a unified internal representation and optimized by the hybrid query compiler.

DB2 pureXML query evaluation run-time contains three major components for XML query processing:
    • XML navigation
    • XML index runtime
    • XQuery function library

Several relational run-time operators also have been extended to deal with XML data. The most important distinguishing property of XML data is its flexible nature. This has a major impact on query evaluation run-time as it requires dynamic dispatch of functions at run-time. DB2 pureXML provides a new XML navigation engine, which evaluates path expressions over the native store by traversing the parent-child relationships in XML storage. It returns node references and atomic values to be processed further by the query run-time. Almost all XML DBMS systems execute XPath expressions by modeling and evaluating one step at a time. By contrast, DB2 pureXML’s navigation engine can execute multiple complex XPath expressions concurrently and holistically.

Delving into three technology issues
After the first release of DB2 pureXML, IBM Research began concentrating on three important problems:
    • Performance improvements
    • Simplification of XML indexing
    • Search and discovery of XML content repositories

Bundling multiple XPath expressions into a single operator
To enhance the performance of XQuery and SQL/XML queries, the data management research team developed a new optimization strategy, which bundles multiple XPath expressions into a single operator.

We found that in most cases, combining multiple expressions for simultaneous execution significantly improves query performance. Blindly combining all XPath expressions in a query and reducing the search space, however, carries the risk of eliminating the optimal query execution plan from consideration. We addressed this problem by using a new optimization algorithm that identifies the most beneficial set of XPath expressions to be evaluated concurrently, and optimizes these expressions globally in the execution plan as well as locally within each navigation operator.

Providing efficient XML indexes: A challenging task
A second problem was the simplification of indexing for XML querying.

An XML column in DB2 pureXML can store documents conforming to many different schemas, resulting in very heterogeneous document collections. In this scenario, however, providing efficient value indexes, which are specified by XPath expressions, is an overwhelming and challenging task.

On the one hand, indexing every value in every XML document may be prohibitively expensive due to storage and update maintenance costs. On the other hand, enumerating specific paths to be indexed may be cumbersome and challenging for users. As new document types are added to the collection, new indexes must be created and maintained.

Moreover, we observed that XML documents have characteristics of both semi-structured data and unstructured text, requiring full-text indexes for IR-style keyword search queries.

We investigated whether a full-text index, with some degree of XML awareness and XPath support, can be used to support XQuery predicates, as well as full-text search queries. The goal is to use the full-text index as a fast and approximate filter so that the user's original XQuery can be executed over only the surviving documents using a full XQuery processor. The results of our initial experiments are promising: They demonstrate the effectiveness of the full-text index as a viable alternative to regular pattern-based XML indexes.

Searching for XML content repositories with thousands of different XML schemas
Finally, we addressed the search and discovery of XML content repositories, which contain XML documents conforming to thousands of different XML schemas. In repositories such as GoogleBase, users are creating their own schemas and contributing content.

These new repositories typically have a high degree of heterogeneity. Users basically have two choices to query such collections: Keyword search or XQuery. On the one hand, keyword search is too broad and may return too many irrelevant documents. On the other hand, XQuery/XPath is too strict as the user needs to know all the different XML schemas to formulate the correct XPath expressions. Either users may not know the exact schema to specify the correct query, or they may be interested in too many different schemas.

Formulating XQueries in such an environment is challenging. We are currently exploring meaningful ways to query such XML content repositories by developing new techniques for performing semantic searches. We allow the users to query XML collections with whatever knowledge of the collection they possess.


Related links
Celebrating ten years of XML (IBM Systems Journal 45(2)
IBM publications & software (IBM)

Rate this article

Innovator's corner  

Fatma OzcanFatma Ozcan Researcher
What's the potential for the work you are doing?
DB2 pureXML provides the middleware necessary for the next generation of SOA applications. It also has the potential to be the infrastructure for emerging Web 2.0 applications, such as mashups.

What is the most interesting part of your research?
There are vast volumes of data everywhere. There is not a single day that we do not utilize a DBMS -- at the supermarket, at the ATM, etc. Information management lets people access, analyze and understand such data, which would otherwise be almost impossible to do.

Who or what inspired you to go into this field?
My first database course as an undergrad intrigued me to study data management.

What is your favorite invention of all time?
The airplane. It connects the world, enabling people to study and work thousands of miles away from home. It allows people to travel and explore the world, gaining invaluable understanding and appreciation for the world's diversity and people.

Research team  

Andrey Balmin, Kevin Beyer, Don Chamberlin, Yuan-Chi Chang, Bobbie Cochrane, Latha Colby, Vanja Josifovski, Quanzhong Li, Lipyeow Lim, Guy Lohman, Eric Louie, George Mihaila, Fatma Ozcan, Hamid Pirahesh, Berthold Reinwald, Bob Schloss, Dave Simmen, Ashutosh Singh, Min Wang, Chun Zhang and Zografoula Vagena