DOM-inating with XML

Using XML in your applications is just a matter of using the right API. Norm shows us how to get started.

XMLopener graphic

In the July issue (http://www.linux-mag.com/2001-07/xmlintro_01.html), we discussed the basics of XML and described how to store information in XML documents. This month, we’ll explore the fundamentals of XML programming — how to get information out of XML documents. To get a feel for reading and manipulating XML documents, we’ll build two complete, simple applications — a word counter and a search-and-replace tool.

We’ll be using Java and Perl in our examples, but the same principles apply to C, Python, Tcl, or whatever your favorite language may be. See Resources, pg. 45, for information about finding the XML libraries and APIs appropriate for your programming language.

XML Documents

At first glance, a simple XML document like the one shown in Listing One is just a bunch of tags and data. But there’s more to it than that. The tags provide structure and context for the data. By reading the XML, we can tell that 555-1234 is Jane Smith’s work phone number. Furthermore, we know that her phone number is only one part of her address record. If there were additional address records in the address book, we could easily find the names and numbers associated with each address.

Listing One: A Simple XML Document

<?xml version=’1.0′?>
<?addressbook preferred-phone=’work’?>
<name>Jane Smith</name>
<phone type=”work”>555-1234</phone>

Figure One: Structure of an XML document.

When you need to work with XML in a program, you’ll usually rely on an “XML Parser” to give your program access to the data in the XML file in a uniform and structured way. The parser’s job is simply to read in text from the XML file and make it easy for you to access the data and to navigate its structure. It does this by converting the data represented by the XML into a tree data structure. Each node in the tree represents a different type of data (element, attribute, processing instruction, etc.) in the original XML.

The tree represented by the XML in Listing One is illustrated in Figure One.

APIs for Handling XML

Rather than interface directly with the XML Parser, most languages provide one or more APIs that implement a level of abstraction between the parser and your code. The two most common APIs are the Simple API for XML (SAX) and the Document Object Model (DOM). The rest of this article explores both APIs using Perl and Java code to construct the two example applications.

SAX is an event-driven API. As the XML parser is parsing the document, it generates “events” as it encounters tags, attributes, data, and so on. You can register handlers that will be called when each of these events occur. The handler will receive information about what triggered the event.

The SAX2 API, released in May 2000, includes several sets of handlers. The most common one is the ContentHandler; it allows you to process the events you are usually most interested in: the start and end of elements, characters, etc.

DOM is the more tree-oriented API. When using a DOM parser, what you receive from the parser is a Document node. Behind the Document note is a tree of nodes representing the logical structure of the document.

In order to use the information stored in a DOM, you must navigate around the tree and inspect the content of each node. Luckily, the DOM provides methods that let you find the children and siblings of an element, its attributes, parent, and ancestors. The basic navigation primitives provided by the DOM are sufficient to locate every node in the tree.

Which Is Better, SAX or DOM?

As you might expect, there’s no easy answer. The APIs are different, so you’ll find that some problems are just easier to solve with one rather than with the other. Let’s look at a few of the important strengths and weaknesses of each API:


  • Strength. It’s easy to stream. Event-driven APIs can begin processing a document as soon as the parser starts and require a more-or-less fixed amount of memory to process documents of arbitrary size.
  • Strength. You can select the events you want. Don’t register handlers if you’re not interested in comments or end tags.
  • Weakness. Only “preceding” nodes are available. When an event occurs, the only information that can be known comes from events that have already occurred. If an application needs or wants to “look forward,” at children, for example, it must perform its own buffering.


  • Strength. The entire document is available. An application can navigate through the document at any time.
  • Weakness. The entire document must be in memory. While “lazy” and caching implementations are possible, most implementations just assume that there is enough memory to store the whole document.
  • Weakness. The DOM always contains all of the nodes in the document. If you’re only interested in specific kinds of nodes, you generally have to explicitly ignore the other kinds as you navigate through the document.

Example One: Word Counting

Our first example is a simple word counting application. Given an XML document, the task is to count all the words that are actual textual content, ignoring markup.

Using SAX in Java

SAX, as we’ve discussed, is an event-driven API. The first thing we have to do is set up a handler for the events in which we’re interested. For this application, we care about character content, which is part of the ContentHandler.

The easiest way to construct a new ContentHandler is to extend org.xml.sax.helpers.DefaultHandler. Our class, WCHandler, is shown in Listing Two.

Listing Two: The WCHandler Class in Java

public class WCHandler extends DefaultHandler {
public int wc = 0;

public void characters(char[] ch, int start, int length)
throws SAXException {

// Use a StringTokenizer to count the words…
String str = new String(ch, start, length);
StringTokenizer st = new StringTokenizer(str);
wc += st.countTokens();

The methods of the DefaultHandler ignore all events. Our class overrides the characters event to count the number of words in each string of text from the document.

Now that we have a handler to count words, we need to set up a SAX parser to use that handler. You can get the complete source for all of the example programs in this article from http://www.linux-mag.com/downloads/xmlprog/. In the interest of space, we’ll concentrate our explanation on just the interesting bits.

After some initial preamble, we establish a JAXP SAXParserFactory:

// Construct a SAX Parser using JAXP
SAXParserFactoryfactory = SAXParserFactory.newInstance();

JAXP is an established Java standard for processing XML and creating parsers of various sorts. You don’t have to use JAXP to use SAX; just construct a SAX Parser using whatever API you have available. Next, we instantiate our handler:

// Our handler will actually count the words
WCHandler handler = new WCHandler();

Finally, we construct a parser and parse using our handler:

try {
SAXParser parser = factory.newSAXParser();
parser.parse(xmlfile, handler);
} catch (Exception e) {
// Fail ungracefully

Parsing an XML document may throw a number of different exceptions, ranging from file not found and I/O errors to XML well-formedness problems and configuration errors. These example programs will simply abort if one is received and show a stack trace. Real applications would certainly need to catch different flavors of exception and respond more gracefully.

When the parser.parse() method is called, the parsing will begin, and each time text content is encountered, our character handler in WCHandler will be executed. When parsing finishes, we simply print out the word count that our handler recorded:

// Interrogate our handler to get the word count
System.out.println(“There are ”
+ handler.wc
+ ” words in ”
+ xmlfile);

Using DOM in Perl

Using the DOM is going to construct a complete object model for the document. What we’ll need to do then is simply walk over the tree, counting the words in each text node.

First, we create a parser using XML::DOM::Parser:

# Create a parser
my $parser = new XML::DOM::Parser (NoExpand => 0);

Next, we parse the source document. This constructs a DOM representation and returns it to us:

# Parse the XML document
my $xmldoc = $parser->parsefile($xmlfile);

XML documents may begin with some amount of prologue. We don’t care about that prologue, so we use getDocumentElement to get the Element node that represents the root of the document:

#Get apointertotherootelement of the document
my $root = $xmldoc->getDocumentElement();

The root node is passed to a simple recursive function that walks over all of the nodes in the tree:

# Now walk over the nodes, counting words
my $wc = &wordCount($root);

The heart of wordCount is a loop that examines each child of the current element. If the child is itself an element, we count all of the words inside it by calling wordCount recursively. If the child is a text node, we count all of the words it contains. Otherwise, we just ignore the node:

   # as long as there are more children…

while ($child) {
if ($child->getNodeType() == XML::DOM::ELEMENT_NODE) {
# count the words in sub-elements
$wc += &wordCount($child);
} elsif ($child->getNodeType() == XML::DOM::TEXT_NODE) {
# count the words in text nodes
my $text = $child->getData();
if ($text !~ /^\s*$/) {
my @words = split(/\s+/, $text);
$wc += ($#words + 1);
} else {
# ignore other node types
# move to the next child
$child = $child->getNextSibling();

Finally, we return the number of words counted and print that result.

Example Two: Search and Replace

Two aspects of our word counting application made it really easy to write; first, it didn’t have to change anything — it was purely read-only; second, it ignored all the markup and just looked at the text nodes.

Our second example is a little more ambitious; we’re going to write a very simple search-and-replace application. We can’t use standard tools like sed on XML files because they might make changes to markup that wouldn’t be valid; changing “red” to “yellow” doesn’t mean we want to change <credit> to <cyellowit>.

Our application will take three arguments: the name of an XML file, the string to find, and the replacement string. It will also have one option — to specify whether or not string replacement should occur inside attribute values.

For simplicity, the modified document will simply be printed to stdout.

Using SAX in Perl

The goal of our application is to make changes to a document as we process the SAX events that occur while it is parsed. However, for our application to be of any use, it must do something with the modified content. In particular, we want to print the modified document.

How can we do this? One way would be to write a content handler that registered interest in all the SAX events and printed the content that it received. This would work, but it would not be very reusable. Instead, let’s write our content handler so that it can be chained together with other handlers. Instead of ignoring the events that it isn’t interested in, it will pass them on to some subsequent handler. It will also pass on the results of the modifications that it makes.

This is the idea behind a SAX Filter. The SAX2 API includes several classes and methods to make filters easier to use and write, but the PerlSAX module that we’re using doesn’t support those methods yet, so we’ll have to do it by hand.

Just as in the Java example, all of the real work is done in the content handler. This time we’re interested in attributes (which are associated with start element events) and text nodes, so we need to implement two methods in order to implement the core of our application — start_element and characters.

First, however, we need to set up our content handler object so that it can store some information when it is constructed:

sub new {
my $type = shift;
my $target = shift;
my $replace = shift;
my $in_attr = shift;
my $chain = shift;

my $self = { ‘target’ => $target,
‘replace’ => $replace,
‘in_attr’ => $in_attr,
‘chain’ => $chain };

return bless $self, $type;

When new is called, it will be passed four parameters: target and replace, the strings, in_attr (a flag that indicates whether or not replacement should be done inside attribute values), and chain. The chain parameter should be another instance of a content handler. All events will be passed to the chained handler.

We store these arguments in a hash and bless it, constructing a Perl object.

Using the information stored in the hash, our characters method can now perform the replacement and pass along the result:

sub characters {
my ($self, $data) = @_;

$data->{‘Data’} =~ s/$self->{‘target’}/$self->{‘replace’}/sg;

$self->{‘chain’}->characters($data) if $self->{‘chain’};

The start_element handler has to do a little bit more work. If replacement is to be performed in attribute values, each of the attributes must be processed:

sub start_element {
my ($self, $element) = @_;

if ($self->{‘in_attr’}) {
foreach my $name (keys %{$element->{‘Attributes’}}) {
my $value = $element->{‘Attributes’}->{$name};
$value =~ s/$self->{‘target’}/$self->{‘replace’}/sg;
$element->{‘Attributes’}->{$name} = $value;

$self->{‘chain’}->start_element($element) if $self->{‘chain’};

The rest of our methods (comment, for example) simply call the chained handler:

sub comment {
my $self = shift;
$self->{‘chain’}->comment(@_) if $self->{‘chain’};

Now that we have a content handler, it’s time to look at the initialization. First, we get our arguments:

my $usage = “Usage: $0 [-a] xmlfile target replace\n”;

die $usage if not getopts(‘a’);

my $xmlfile = shift @ARGV || die $usage;
my $target = shift @ARGV || die $usage;
my $replace = shift @ARGV || die $usage;

For this article, we also have another crude content handler that prints all the events it receives (a process often called serialization). First, we create an instance of that handler, then we create an instance of our content handler, passing the serializer to it:

my $shandler = new SerializeHandler();
my $handler = new MyReplaceHandler($target, $replace, $opt_a, $shandler);

The real work is done by initializing and calling the parser:

my $parser = new XML::Parser::PerlSAX (Handler => $handler);

$parser->parse (Source => { ‘SystemId’ => $xmlfile });

The result, printed on stdout, will be the modified XML.

Using DOM

Putting our DOM hats back on, the first thing we have to do is get a parser to build the DOM for us. Using JAXP again, this is fairly straightforward.

First get a DocumentBuilderFactory, then get a DocumentBuilder, and then use the builder to make a DOM; it will return the Document node. From the Document, we can get the root Element:

factory = DocumentBuilderFactory.newInstance();

try {
builder= factory.newDocumentBuilder();
Document doc = null;

try {
5doc = builder.parse(xmlfile);
System.out.println(“SAX exception.”);

Element root = doc.getDocumentElement();

Next, we can simply walk recursively over the DOM, and then print the result:


The walk method starts with a node and walks over the subtree at that node:

private static void walk(Node node) {

// Element nodes get special consideration:
if(node.getNodeType() == Node.ELEMENT_NODE){

If we are performing replacement in attributes, we have to walk over all of them. The DOM getAttributes method returns a list in a NamedNodeMap. We walk over that list, extracting each attribute and modifying its value, as shown in Listing Three.

Listing Three: Modifying Attribute Values

/ Check the attributes
if (in_attr) {
Element element = (Element) node;
NamedNodeMap attributes = element.getAttributes();
for (int count = 0; count < attributes.getLength(); count++) {
Attr attr = (Attr) attributes.item(count);
String value = attr.getValue();
value = stringReplace(value, target, replace);

Elements may also have children. If this one does, we walk over each child:

private staticvoidwalk(Nodenode){

Node child = node.getFirstChild();
while (child != null) {
child = child.getNextSibling();

We’re also interested in text nodes on this walk, so we perform the necessary replacement in each text node:

   } else if (node.getNodeType() == Node.TEXT_NODE) {
// Get the text out of the text node
Text textNode = (Text) node;
String text = textNode.getData();
text = stringReplace(text, target, replace);

When the walk is finished, the entire tree will have been updated. Writing the result is very similar; it walks over the tree and prints a serialized representation of each node.

Both versions of this program use very simple serialization algorithms; they don’t take into consideration character encoding and a number of other issues. Many tools come with better interfaces for performing serialization, and real applications should use them if they’re available. Internationalization is another area that could use improvement in these applications.


In this article we’ve explored SAX and DOM programming in both Perl and Java. Hopefully this has whet your appetite to do more XML programming, no matter what your favorite programming language is. Using the techniques presented here, you should be well on your way toward doing so.

The Many Roles of XML

XML can be a confusing technology, but not because it is inherently complex or difficult to understand. At the most basic level, XML is rather simple. As the “ML” in “XML” indicates, it is a mark-up language — one that often resembles the well-known HTML.The “X” in “XML” — extensibility — is what often causes confusion. It has been applied to such a diverse set of problems in so many different fields that it can be mind-boggling to most newcomers.

Storing Structured Information

Let’s look at some of the most common uses for XML today.

Applications, especially those written in Java, are beginning to use XML as their native storage format for configuration files, documents, persistent objects, and many other types of data. Java (followed closely by Perl and Python) has access to a number of XML parsers, very mature libraries, and APIs, all of which make it easy to write software that reads, writes, and works with XML.

Unlike some “standards,” there is only one XML. There can be no “Microsoft XML” or “Sun XML” or “RedHat XML. “A single person, company, or group cannot co-opt an XML for their own use. This means that developers can freely use XML in their applications and not worry about incompatibilities across platforms and software versions. As a result, XML is quickly becoming the preferred format for storing data in new applications.

Unlike proprietary file formats, using XML means not having to write, debug, and optimize your own parser and validation routines over and over again. Any XML parser can parse valid XML.

The XML code libraries and APIs are built in such a way that programmers don’t even need to understand XML. Instead, they work (as programmers always have) with data structures. These data structures are often built up from data that is stored in files, retrieved from the network, or constructed on the fly.

Creating New Mark-Up Languages

The XML parser eliminates the need to write any parsing code in order to read the data. And when it comes time to produce XML, it’s often just a matter of calling a function or method that converts the data structure to XML.

When it comes to storing and retrieving data that’s not already in some type of structured database, XML really makes a programmer’s life easier.

XML is a language that can be used to create new mark-up Languages — specific XML dialects that are custom built for a specific task or domain of knowledge. Individuals and organizations around the world have used XML to create new mark-up languages that are being used to replace older, proprietary (often binary) document formats.

Examples include:



Several international news organizations (led by Reuters) have developed an open and standard mark-up language called NewsML. NewsML is now being adopted as the standard way to exchange and syndicate news.



Mathematicians have developed MathML as an open and standard mark-up language for representing equations and other mathematical constructs.


The emerging XML-based language for representing Scalable Vector Graphics, SVG could one day replace the proprietary file formats used by many commercial software packages.If that happens, no longer will you be confined to using a single piece of software to manipulate drawings and figures. SVG will make it possible to import and export figures without loss of quality or content.



Looking to make the process of reporting and exchanging financial and business data easier, a team of companies is working to formalize the eXtensible Business Reporting Language.



RDF Site Summary is a wildly popular, lightweight document format that lists resources (often as headlines) available from a given site. RSS is used on thousands of free and commercial Web sites (Slashdot, Freshmeat, Netscape, etc.) to easily syndicate headlines.

The list of XML-derived languages is already quite large. One of the key points about all these languages is, like XML, they are all open standards. And because XML is a rigidly defined language, there is no room for ambiguity. An XML-based language is exactly the same in every piece of software and on every platform that uses it.

Information Exchange

The final “killer application” of XML, information exchange, is really just the logical extension of XML being great for creating new mark-up languages and storing, reading, and writing structured data.

Rather than invent an entirely new network protocol every time two programs wish to communicate, it is easier to use an existing (and popular) protocol, like HTTP, and XML to encapsulate the information that needs to be exchanged. This sort of network-based information exchange using XML has already proven to be very powerful. Here are a few examples:


The Jabber project has built an infrastructure for routing XML messages between any two computers on a network. Yes, it was originally developed as an open instant messaging (IM) system, but the Jabber team chose to use XML to encapsulate the messages they pass, and that has enabled them to generalize their messaging platform into one that can pass arbitrary data from person A to person B…or from process A to process B.

Web Services

XML-RPC and SOAP (the Simple Object Access Protocol), the two cornerstones of Web Services, are both XML-based protocols that aim to simplify RPC (Remote Procedure Calls) as well as sending objects across the network. Unlike the previous generation of competing and incompatible standards — Microsoft’s DCOM (Distributed Component Object Model) and the Object Management Group’s COBRA (Common Object Request Broker Architecture) — SOAP and XML-RPC have quickly become the next-generation models upon which applications are being built. Both SOAP and XML-RPC are platform neutral and are more flexible and easier to program than their predecessors.

XML is not just another tool. It is an enabling technology that is continuing to find its way into new areas of science, information technology, and the Internet. In the business world, XML is already the standard way of exchanging even ad-hoc data between companies. Its simplicity, extensibility, and open nature are working to ensure that we’ve only begun to see the doors that it is likely to open. — Jeremy Zawodny


JAXP: http://java.sun.com/xml
The Java API for XML (JAXP) is a popular standard interface.

Perl XML Modules: http://search.cpan.org/search?mode=module&query=XML
XML::DOM is a DOM interface for Perl; XML::Parser::PerlSAX, a SAX interface, is included in libxml-perl.

PyXML: http://sourceforge.net/projects/pyxml
The Python library from the Python SIG for XML Processing.

TclXML: http://www.zveno.com/zm.cgi/in-tclxml
TclXML is the most widely used set of XML tools for Tcl.

expat: http://www.jclark.com/xml/expat.html
Expat is the real workhorse behind a number of XML tools.

Norm Walsh is a Staff Engineer in Sun’s XML Technology Center. He can be reached at norman.walsh@sun.com.

Comments are closed.