XML is becoming the standard method for storing and exchanging structured data. Here's the essential information you need.
On an average day, you probably work with files in a dozen different formats: plain text, log files, HTML, comma-separated files, colon-delimited files, various sorts of configuration files that have keys and values delimited in some quasi-standard way, completely binary formats, and the proprietary formats of your favorite set of “office” tools.
Unix comes with a whole suite of tools to make working with these various formats practical: head, tail, cat, cut, sed, etc. And if those won’t do the job, you can always pull out your favorite Swiss army knife: perl, python, or tcl.
Have you ever wished that there was a better way — a common, flexible format you could use to store all this information? Then you could write a single parser and a suite of applications to read, process, transform, and query all these various files in a uniform way. If you have, chances are that you want to look at XML.
In the larger world, XML has already been embraced by many of the worlds largest software companies (Oracle, Microsoft, IBM, Sun) as well as some of today’s leading open source projects. See the XML in Open Source sidebar for more on this topic.
So if you’re new to XML, there is no better time than now to start learning.
What is XML?
XML is a standard way of describing structured information. The term “structured information” covers a broad spectrum of applications: business transactions, inter-process communication, electronic messages, database tables, remote procedure calls, Web pages, user preferences and application configuration data, drawings, mathematical equations, and even documents like this article can be described as structured information. As a matter of fact, this document was written entirely in XML.
XML uses elements and attributes (in much the same way as HTML) to identify individual items of information. These items can be mixed together in various ways to form larger items. A simple XML document is shown in Listing One.
Listing One: A Simple XML Document
<p>Some <em>more</em> text.</p>
<p>See also <xref doc=”otherdoc.xml”/>.</p>
To explore how XML works, we’ll look at how you could convert an example file format (an address book) into XML and the advantages that you’d get from doing so.
Suppose you need to keep track of a bunch of names and addresses. One way that you could do this is with a comma-separated text file, like the one in Listing Two.
Listing Two: A Comma-Delimited Address
Jane Smith,15 Pine St,Suite 304,Springfield,OR,92744,555-1234,,555-9999
John Val\’ery,123 Any Street,,Anytown,MA,01034,,,
Barney & Betty Rubble,”3 Rock Terrace, Apt 2b”,,Bedrock,MT,88432-2433,,,
Tucker Case,533b Ridge Rd,Pine Grove,CA,92315,,555-5309,
Using simple Unix utilities, you could write programs to handle the data. So if you already have tools to process the address book, why convert it to XML?
XML offers some real advantages, both for this particular application and for larger systems:
1. XML Is the Extensible Markup Language. It has no predefined set of tags, so it can be used to represent any form of data: Web pages, technical documentation, database tables, etc. Unlike HTML which has a fixed set of tags, XML allows us to have tag names that make sense for this application’s <address>, <name>, etc.
2. XML Formats Are Self-Describing. Because each piece of information in XML is identified with a name, it’s easy to understand. The data in Listing Two isn’t well described. You can infer the meaning of some of the fields, but you’d have to build software that knows what a phone number looks like, for example.
3. XML Is an Open Standard. There are no proprietary aspects that open source developers will have to work around or reverse-engineer. In addition, the format is text-based, so you can use standard Unix tools to search and manipulate the data. This also means you can get off- the-shelf parsers that will work in interoperable ways.
4. XML Is Based on Unicode. This means that it is fully internationalized out of the box. There are relatively few constraints on the characters that may be used, freeing document authors and processing systems to work in the language that is most convenient for their locale.
Listing Three shows what the first address record from Listing Two looks like in XML.
Listing Three: The First Address in XML
<street>15 Pine St</street>
Because XML is an open standard, it is quickly becoming the format of choice for open source projects that need to store and/or exchange structured information. In fact, XML is on the road to becoming a universal enabler that allows applications and services to communicate with each other.
Here are just a few of the popular open source projects using XML today:
Jabber The popular instant messaging system uses XML as its native messaging format. In fact, Jabber is quickly evolving from an instant messaging platform into a more generic distributed XML routing system. (More on this in an upcoming article…)
Zope As a heavily object-based system, Zope uses XML to represent objects and data whenever they need to be imported or exported from its object database.
Mozilla While “skins” for applications and desktops have become quite popular, Mozilla implements an XML-based language called XUL, pronounced “zool” (from the Ghostbusters movie), which allows you to completely configure its user interface.
Basic Features of XML
The basic building blocks of XML documents are elements and attributes. So let’s take a look at these.
Elements are delimited by angle brackets. Elements that have content appear in the form <element-name> followed by some optional content followed by </element-name>. The first form is the start tag, the second is the end tag. In XML, every non-empty element must have explicit start and end tags, and they must be properly nested. Unlike HTML, you may never omit end tags.
Empty elements have the form <element-name/>. The trailing slash before the closing angle bracket indicates that the element has no content and consequently no end tag is allowed. In Listing One, you know that the xref element is empty because it has a closing />.
Element names must begin with a name character and may contain any number of letters, digits, hyphens, underscores, and periods. The full repertoire of name characters is given in the XML recommendation. It’s considerably larger than the US ASCII set of alphabetic characters.
Element start tags may additionally have attributes. (End tags may not.) In Listing Three, Jane Smith’s phone number is identified with an attribute. Attributes are name-value pairs. Like element names, attribute names must conform to the XML notion of a name. All attribute values must be quoted with either double or single quotes. Listing Four shows some examples.
Listing Four: Attributes in XML
<foo nested-quotes=”‘single in double’”>
<bar nested-quotes=’”double in single”‘>
<baz id=”test” multiple-attributes=”ok”>
Converting the next address book record, shown in Listing Five, introduces some new ideas.
Listing Five: The Second Address in XML
<street>123 Any Street</street>
First, note what we’ve done with the \’; we’ve turned it into é. That’s because a lowercase “e” with an acute accent is located at position 233 in Unicode.
XML reserves a small number of characters, like <, to identify document markup. In order to use these characters literally in your document, they must be escaped somehow.
There are several escaping mechanisms. One of the most general is the numeric character reference. Any Unicode character can be referenced in your document using a numeric character reference, regardless of the character encoding of the document itself. These have the form