The idea of a database designed explicitly to store XML has been around for a few years now. Back in
1999, Software AG made the first large-scale attempt to build such a database with Tamino. Software
AG called Tamino a “native XML database” — a database that stores all components of the XML intact.
Since the release of Tamino, many other native XML databases have come to market. The newest one is Berkeley DB XML from Sleepycat Software, makers of the ubiquitous Berkeley DB. While Berkeley DB XML is a new, native XML database, it uses Berkeley DB as its foundation. That means Berkeley DB XML comes out of the box with robust database features that have been tested and proven in a great many real-world applications. If your application uses XML extensively, Berkeley DB XML is a perfect solution for persisting your data.
Like Berkeley DB, Berkeley DB XML is released under the Sleepycat Open Source license, so it can be used without charge to build Open Source applications. Also like Berkeley DB, and unlike most other native XML databases, Berkeley DB XML is an embedded database, meaning that it’s linked directly to your application — no server to talk to, no wires, no mapping — just XML documents in and XML documents out.
XML Databases
Even though native XML databases have been around for about five years, they’re still a novelty to many developers. In general, a native XML database:
Defines a (logical) model for an XML document — as opposed to the data in that document — and stores and retrieves documents according to that model. At a minimum, the model must include elements, attributes, PCDATA, and document order. Examples of such models are the XPath data model, the XML Infoset, and the models implied by the document object model (DOM) and the events in SAX 1.0.
Has an XML document as its fundamental unit of (logical) storage, just as a relational database uses a row in a table as its fundamental unit of (logical) storage.
Is not required to have any particular underlying physical storage model. For example, it can be built on a relational, hierarchical, or object-oriented database, or use a proprietary storage format such as indexed, compressed files.
In other words, a native XML database is specialized for storing XML data, and stores all components of the XML model intact. XML documents go in and XML documents come out, and a native XML database may not actually be a standalone database at all.
Native XML databases don’t really represent a new low-level database model, and they aren’t intended to replace existing databases (like Oracle or MySQL). Instead, they’re simply a tool that provides robust storage and manipulation of XML documents.
Where to use Berkeley DB XML
So, where can a native XML database like Berkeley DB XML be used? Just about any place. That’s the simple answer. After all, Berkeley DB XML (BDBXML) is a general-purpose database library. And while you could build an accounting system with BDBXML, you probably wouldn’t — perhaps even shouldn’t.
There are two primary reasons to use BDBXML. The first and most obvious reason is that your application makes extensive use of XML. If you’re already using XML heavily, using something like a relational database for persistence imposes a great deal of overhead. Admittedly, it’s difficult to quantify that claim, but the process of mapping XML data to and from other database types is generally slow and difficult to do correctly.
If you’re using XML extensively, developing with BDBXML will be faster, easier, and will lead to a more robust solution that will likely perform better. BDBXML doesn’t have to map XML to a relational database — it just stores the data natively as XML. Better yet, the XML that’s retrieved is exactly the same as what was persisted — a feature known as document round-tripping. When mapping, document round-tripping is virtually impossible to achieve. BDBXML handles it without a problem.
The second and less obvious reason to use BDBXML is if your application needs to manage what is usually called semi-structured data. Semi-structured data is all the data we have floating around that has some structure, but is either too loosely defined or too complex to put into a relational database schema. Examples of semi-structured data include word processing files, text files, note taking applications, and spreadsheets.
Fortunately, there’s now a trend to move much of this data into XML, so that XML tools such as native XML databases can be applied to manage the data. If you’re building a new application that generates large amounts of semi-structured data, using BDBXML for persistence can be a big advantage. It can bring many of the advantages of large relational databases (i.e. transactions, concurrent access and indexed queries) to systems that require a less structured persistence mechanism.
Programming with DB XML
Since Berkeley DB XML is an embedded database and not a stand-alone server, you have to write a program to be able to do anything useful with it. Before you can do that, you have to install Berkeley DB XML and setup your development environment.
The documentation for BDBXML provides clear instructions for installing the software, so we won’t cover that in detail here. However, the examples in this article use the BDBXML Java API, so make make sure your installation has Java support enabled. (While we’re using Java for our examples, all the language APIs are similar. Therefore, even if Java isn’t your favorite language, the examples presented here will be very easy to translate to a language that’s more to your liking.) The Java bindings are enabled by passing the -enable_java parameter in the configure step when building both Berkeley DB and Berkeley DB XML. If you installed from a binary distribution, it will likely have the Java API already enabled.
Once you’ve installed Berkeley DB XML, you need to add a couple things to the environment to be able to run the example programs.
1. On UNIX-like systems, you have to add the path to the Berkeley DB and Berkeley DB XML libraries to the LD_LIBRARY_PATH environment variable.
2. You have to add db.jar and dbxml.jar to your CLASSPATH. For example, if Berkeley DB and Berkeley DB XML are installed in the default locations, settings similar to the following should work.
LD_LIBRARY_PATH=/usr/local/BerkeleyDB.4.1/
lib/:/usr/local/BerkeleyDBXML.1.1/lib
CLASSPATH=.:/usr/local/BerkeleyDB.4.1/lib/
db.jar:/usr/local/BerkeleyDBXML.1.1/
lib/dbxml.jar
export LD_LIBRARY_PATH CLASSPATH
On Mac OS X, you should use DYLD_LIBRARY_PATH instead of LD_LIBRARY_PATH.
A Simple Example
Now that your environment is set, let’s dive in and see how things work. To start, let’s look at a very simple program that creates an empty database and adds a document to it. While plain, this first example shows the basics of working with the Berkeley DB XML API. The code is shown in Listing One.
Listing One: Add a document to the database
import com.sleepycat.db.*;
import com.sleepycat.dbxml.*;
public class AddDocument {
public static void main(String[] args) throws
Exception {
String document = “<?xml version=”1.0”?>
<example>Just a toy</example>”;
XmlContainer db = new XmlContainer(null,
“example.dbxml”, 0);
db.open(null, Db.DB_CREATE, 0);
XmlDocument doc = new XmlDocument();
doc.setContent(document);
int id = db.putDocument(null, doc, null, 0);
System.out.println(”Document inserted
as ID: ” + id);
db.close(0);
}
}
|
To begin, the code creates an XmlContainer with:
XmlContainer db = new XmlContainer(null,
“example.dbxml”, 0);
XmlContainer represents a collection of XML documents and is stored within a single file in the file system. Here, the file example.dbxml will be created in the current working directory at the time that the program is executed. (Using the .dbxml file extension is the convention when naming Berkeley DB XML databases.)
Once you have an XmlContainer instance, you have to open it to use it:
db.open(null, Db.DB_CREATE, 0);
The Db.DB_CREATE flag creates the container if it doesn’t exist. It’s generally safe to always pass Db.DB_CREATE when opening a container.
Now that you’ve created a container, you can add a document to it:
XmlDocument doc = new XmlDocument();
doc.setContent(document);
int id = db.putDocument(null, doc, null, 0);
To insert a document, wrap it in an XmlDocument instance and put it into the container. The internal identifier that uniquely identifies the document is returned. You can use that identifier later to retrieve, update or delete the document.
Finally, you close the container with db.close(0);. You must always explicitly close containers; otherwise, your database can become corrupted easily. This can be particularly problematic as you start to use advanced features of the database, such as transaction management.
Working with Databases
The previous example demonstrated how easy the basic API is to use. Next, let’s convert the previous example into a class we can reuse. The new class, Database, is shown in Listing Two.
Listing Two: A class that represents a database
import com.sleepycat.db.*;
import com.sleepycat.dbxml.*;
public class Database {
XmlContainer db = null;
public Database(String dbPath) throws
Exception {
db = new XmlContainer(null, dbPath, 0);
int flags = Db.DB_CREATE;
db.open(null, flags, 0);
}
public int insertDocument(String docString)
throws Exception {
XmlDocument doc = new XmlDocument();
doc.setContent(docString);
return db.putDocument(null, doc, null, 0);
}
public void close() throws Exception {
db.close(0);
}
}
|
While the code’s not that exciting, it does make the previous code a little more reusable.
Now let’s add some additional BDBXML operations. Each method shown below should be added to the Database class in Listing Two.
First, let’s add a method to retrieve an existing document from the database:
public XmlDocument getDocument(int id)
throws Exception {
return db.getDocument(null, id, 0);
}
Here, XmlContainer.getDocument() (remember that db is an XmlContainer) retrieves the document. getDocument() takes three parameters, two of which are ignored. (The other parameters are the transaction ID and a flags parameter that’s used to configure how the retrieval occurs with respect to open transactions and concurrent access. Since the examples here are simple and the documents are only accessed by one program at a time, the other parameters can be safely ignored.) The only parameter used is the id of the document (which was obtained previously from XmlContainer.insert Document()). We’ll see later how to retrieve a document without knowing its id.
Now add a method to delete a document:
public void deleteDocument(int id) throws
Exception {
XmlDocument document = getDocument(id);
db.deleteDocument(null, document, null, 0);
}
The method XmlContainer.deleteDocument() takes an XmlDocument instance rather than a document id as you might expect. This means you must first retrieve a document before you can delete it.
As with XmlContainer.getDocument(), you can ignore the first and last parameters, and can also ignore the third parameter, which is an optional XmlUpdateContext that can be used to cache reusable objects when performing multiple deletions.
Next, let’s add a method to update an existing document:
public void updateDocument(int id, String
content) throws Exception {
XmlDocument document = getDocument(id);
document.setContent(content);
db.updateDocument(null, document, null);
}
To update a document, you first retrieve its XmlDocument instance and then replace the content before passing it to XmlContainer.updateDocument(). You always have to replace the entire document when doing an update. A future Berkeley DB XML version will add the ability to update documents with greater granularity.
Now that you can add, delete, and update documents, you can finally get to the brain candy: searching documents. For search, let’s add two methods: the first method performs the query, and the second method wraps the results in a new document.
Here’s the query method:
public XmlResults xpathQuery(String query)
throws Exception {
XmlQueryContext context = new
XmlQueryContext(XmlQueryContext.ResultValues,
XmlQueryContext.Eager);
return db.queryWithXPath(null, query,
context, 0);
}
An XmlQueryContext instance is used to specify how you want the query to be processed. The first parameter specifies how much of the document you want to retrieve. Here, Xml QueryContext.ResultValues retrieves just the matching nodes from the document, rather then the entire matched document. (Other options include XmlQuery Context. ResultDocuments to return the entire matched document, and XmlQueryContext.ResultDocumentAnd Values to return both the document and the matching nodes.)
The other parameter to the XmlQueryResult constructor specifies how the query results should be read from the database. There are two options for this setting: XmlQuery Context.Eager reads all results into memory at once, and XmlQueryContext.Lazy reads individual results as needed. Eager evaluation is used here, because the result values are just going to be wrapped into a new document.
With XmlQueryContext configured, you can run the query via XmlContainer.queryWithXPath(). Again, you can ignore the first and last parameters. Your XPath query string is passed as the second parameter and the XmlQuery Context instance created previously is passed as the third. The return value is an XmlResults instance that contains a list of the results.
Here’s the second search method. It executes an XPath query against the database and wraps the results in a new document.
public String xpathQueryAsString(String
query) throws Exception {
XmlResults results = xpathQuery(query);
StringBuffer buffer = new StringBuffer();
buffer.append(”<results>”);
XmlValue value;
while ((value = results.next(null))
!= null) {
buffer.append(value.asString(null));
buffer.append(”n”);
}
buffer.append(”</results>”);
return buffer.toString();
}
This method is primarily for convenience, but it’ll come in handy shortly.
Putting the Database to Work
Now it’s time to put Database to work. First, you need a tool to be able to execute the Database class. The tool is shown in Listing Three.
Listing Three: A Java command-line tool based on the Database class
import java.io.*;
import com.sleepycat.dbxml.*;
public class Tool {
XmlContainer db = null;
public static void main(String[] args) throws
Exception {
String database = args[0];
String action = args[1];
String param = args[2];
Database db = new Database(database);
if (action.equalsIgnoreCase(”add”)) {
int id = db.insertDocument(readFile(param));
System.out.println(”Inserted document ” +
param +
” as id: ” + id);
}
else if (action.equalsIgnoreCase(”delete”)) {
int key = Integer.parseInt(param);
db.deleteDocument(key);
System.out.println(”Document id ” + key +
” deleted.”);
}
else if (action.equalsIgnoreCase(”get”)) {
int key = Integer.parseInt(param);
XmlDocument document = db.getDocument(key);
System.out.println(document.getContentAsString());
}
else if (action.equalsIgnoreCase(”update”)) {
int key = Integer.parseInt(param);
String newContent = readFile(args[3]);
db.updateDocument(key, newContent);
System.out.println(”Document id ” + key +
” updated.”);
}
else if (action.equalsIgnoreCase(”query”)) {
String result = db.xpathQueryAsString(param);
System.out.println(result);
}
db.close();
}
public static String readFile(String param)
throws Exception {
BufferedReader in = new BufferedReader(new
FileReader(param));
StringBuffer buffer = new StringBuffer();
String line;
while ((line = in.readLine()) != null) {
buffer.append(line);
}
in.close();
return buffer.toString();
}
}
|
There’s nothing particularly notable about the program — it just uses the Database class to build a simple command line tool to manipulate a Berkeley DB XML database. For the sake of brevity, all error checking has been omitted.
The usage of the program can be summarized as:
java Tool database (add|update|delete|get|query)
(id|filename|xpath) [filename]
For example, to add the file sample.xml to the database you’d run:
$ java Tool sample.dbxml add sample.xml
Now that you have a database program to play with, use it to build a simple address book application. To start, decide what the XML you want to use should look like. There are numerous ways that to structure the XML, but something like this is probably adequate:
<?xml version=”1.0″?>
<address>
<name>
<first>John</first>
<last>Smith</last>
</name>
<phone type=”home”>290-203-2003</phone>
<phone type=”work”>290-392-2193</phone>
<street>1201 N Main St.</street>
<city>Ringdale</city>
<state>Texas</state>
<zip>87292</zip>
</address>
Since Berkeley DB XML is designed around storing large document collections, you can put each address book entry into a separate document and then store that document in the database. This is in contrast to what you often see when people try to use XML as a data storage format.
For instance, it would be common to see something like this:
<addressbook>
<address>…</address>
<address>…</address>
<address>…</address>
…
</addressbook>
This treats a document as if it were a database. In Berkeley DB XML, that’s exactly what you do not want to do. Instead, you should use an XmlContainer and place each individual address entry into that container.
So, make up a few address book entries and add them to the database. It’s up to you what you put in these files, but one entry should be the John Smith example shown above and all of the address book entries should have the same XML structure.
Assuming you’ve created three XML files — address1.xml, address2.xml, and address3.xml — for three separate address book entries, these commands would add the entries to the database:
$ java Tool address.dbxml add address1.xml
$ java Tool address.dbxml add address2.xml
$ java Tool address.dbxml add address3.xml
You can check to see if the data is there by running the following query:
$ java Tool address.dbxml query “/node()”
This query selects the root node of every document in the database. You should see all your address book entries in the results.
Next, try a few other XPath queries. The exact results of each command shown below depend on what content you’ve added to your database.
* Find the addressbook entries for everyone who lives in Texas:
$ java Tool address.dbxml query
“/address[state = ‘Texas’]”
<results>
<address dbxml:id=”1″
xmlns:dbxml=”http://www.sleepycat.com/2002/dbxml“>
<name> <first>John</first> <last>Smith</last> </name>
<phone type=”home”>290-203-2003</phone>
<phone type=”work”>290-392-2193</phone>
<street>1201 N Main St.</street>
<city>Ringdale</city>
<state>Texas</state>
<zip>87292</zip>
</address>
<address dbxml:id=”2″
xmlns:dbxml=”http://www.sleepycat.com/2002/dbxml“>
<name><first>Mary</first><last>Smith</last></name>
<phone type=”home”>290-203-2003</phone>
<phone type=”work”>290-439-0392</phone>
<street>1201 N Main St.</street>
<city>Ringdale</city>
<state>Texas</state>
<zip>87292</zip>
</address>
</results>
* Find the first name of every person with the last name Smith:
$ java Tool address.dbxml query
“/address/name/first[../last = ‘Smith’]”
<results>
<first>John</first>
<first>Mary</first>
</results>
* Find the home phone number for John Smith:
$ java Tool address.dbxml query
“/address/phone[@type=’home’]
[../name/first = ‘John’ and ../name/
last = ‘Smith’]”
<results>
<phone type=”home”>290-203-2003</phone>
</results>
* Find the names of all people who live in area code 290:
$ java Tool address.dbxml query
“/address/name[../phone/@type = ‘home’
and starts-with(../phone, ‘290′)]”
<results>
<name><first>John</first><last>Smith</last></name>
<name><first>Mary</first><last>Smith</last></name>
</results>
XPath is a very powerful language and it’s a good idea to play around to see what else can be done with it. However, a few limitations are worth noting when considering XPath as a database query language.
* There’s no way to sort the query results.
* XPath has no concept of grouping.
* The XPath data model is limited to just numbers, strings and XML constructs like nodes and node lists. This limitation usually becomes apparent when you want to query on dates.
* There’s no way to perform a query that joins between documents. In XPath, you can have relative references to nodes within the same document, but there’s no way to reference any other documents that are participating in the query.
The standard XML solution to all of these problems is XPath 2.0, in combination with XQuery. However, those standards are still in development and are in no way certain to be successful.
A solution that works today is XSLT. While it can’t solve the date problem, it helps tremendously with the rest. Unfortunately, using XSLT is more then we can cover in this article.
Indexing
So far, you’ve managed documents in the database and searched them. Now it’s time to take advantage of one of Berkeley DB XML’s best features: indexing XML data.
Here’s a method to add to Database to index the database:
public void addIndex(String nodeName,
String strategy)
throws Exception {
db.addIndex(null, null, nodeName, strategy);
}
As a corollary, you also need to add another action to the command line program. Add the following code to the main() method:
else if (action.equalsIgnoreCase(”index”)) {
db.addIndex(param, args[3]);
System.out.println(”Index created on: ”
+ param);
}
You can now add indexes by running a command similar to the following:
$ java Tool index zip node-element-
equality-string
Berkeley DB XML defines a number of different indexing strategies for an XmlContainer. The complete list of strategies can be found in the documentation. However, a few commonly used strategies include node-attribute-equality-number, node-element-equality-string, edge-element-presence-none, and node-element-substring-string.
The indexing strategy names can be generalized to (index entry type)-(node type)-(index type)-(value type). For instance, if you define a node-element-equality-string index on the node name city…
$ java Tool index city node-element-
equality-string
… then the content of all element nodes with the name city is treated as a string and an index of those nodes is created to accelerate XPath equality searches. That’s a mouthful, which is why Berkeley DB XML uses a bit more terse syntax.
Understanding the difference between node indexes and edge indexes is probably the most difficult part of indexing in Berkeley DB XML. Edge indexes index the node in combination with all incoming edges to that node, while node indexes index just the node. Because of this, edge indexes are more specific then node indexes and are faster when used in the proper circumstances. For simple cases, you can just use node indexes and not worry about edge indexes.
In a database with three documents, any kind of index really isn’t necessary. In fact, it will just make things slower. However, if you had thousands or millions of documents stored, indexes don’t just help, they’re essential.
For instance, on a test database containing 100,000 documents, a query of the form /address[city=’Phoenix’] takes 43 seconds. Defining a node-element-equality-string index on the city element speeds the query to just 0.17 seconds.
Berzerk with XML
Hopefully, this whirlwind tour of the basics of Berkeley DB XML and native XML databases has been helpful. The sidebar “Berkeley DB XML Features” summarizes all of the product’s capabilities.
This article is too short to go into detail on all the features of Berkeley DB XML, so here’s a quick summary:
* Stores XML natively in the database.
* XML storage is schema independent.
* XPath 1.0 for XML queries.
* Indexes XML data to accelerate XPath queries.
* Allows association of additional meta-data with documents.
* Runs as a library embedded in the application.
* Provides C++, Java, Perl and Python programming APIs.
* Transactions, concurrent access and replication inherited from the Berkeley DB foundation.
* Runs on UNIX, Linux, Windows and Mac OS X.
* Up to 256 Terabytes of data per container and 4 gigabytes per document.
* Easily mix data stored in Berkeley DB with data stored in Berkeley DB XML.
While 4 gigabytes is an internal limit on the size of documents that Berkeley DB XML can store, the practical limit is going to be much smaller.
Berkeley DB XML is designed to manage large quantities of smaller documents rather then a few larger ones. Exactly what smaller is going to mean will depend largely on the particular application and how it uses the XML.
|
For the author, Berkeley DB XML has proven to be an extremely rich and powerful tool for building XML-based applications. If your application makes extensive use of XML and needs a highly flexible storage container, Berkeley DB XML can’t be beat.
Kimbro Staken is an independent consultant, author, and open source developer specializing in technologies for XML data management. You can reach Kimbro at kstaken@xmldatabases.org. You can download the source code from this story at http://www.linux-mag.com/downloads/2004-02/bdbxml.
No comments yet.