On an average day, you probably work with files in a dozen different formats: plain text, log files, HTML, comma-separated files, colon-delimited files, various sorts of configuration files that have keys and values delimited in some quasi-standard way, completely binary formats, and the proprietary formats of your favorite set of “office” tools.
Unix comes with a whole suite of tools to make working with these various formats practical: head, tail, cat, cut, sed, etc. And if those won’t do the job, you can always pull out your favorite Swiss army knife: perl, python, or tcl.
Have you ever wished that there was a better way — a common, flexible format you could use to store all this information? Then you could write a single parser and a suite of applications to read, process, transform, and query all these various files in a uniform way. If you have, chances are that you want to look at XML.
In the larger world, XML has already been embraced by many of the worlds largest software companies (Oracle, Microsoft, IBM, Sun) as well as some of today’s leading open source projects. See the XML in Open Source sidebar for more on this topic.
So if you’re new to XML, there is no better time than now to start learning.
What is XML?
XML is a standard way of describing structured information. The term “structured information” covers a broad spectrum of applications: business transactions, inter-process communication, electronic messages, database tables, remote procedure calls, Web pages, user preferences and application configuration data, drawings, mathematical equations, and even documents like this article can be described as structured information. As a matter of fact, this document was written entirely in XML.
XML uses elements and attributes (in much the same way as HTML) to identify individual items of information. These items can be mixed together in various ways to form larger items. A simple XML document is shown in Listing One.
Listing One: A Simple XML Document
<?xml version=’1.0′?>
<doc>
<p>Some text.</p>
<p>Some <em>more</em> text.</p>
<p>See also <xref doc=”otherdoc.xml”/>.</p>
</doc>
|
To explore how XML works, we’ll look at how you could convert an example file format (an address book) into XML and the advantages that you’d get from doing so.
Suppose you need to keep track of a bunch of names and addresses. One way that you could do this is with a comma-separated text file, like the one in Listing Two.
Listing Two: A Comma-Delimited Address
#name,line1,line2,city,state
Jane Smith,15 Pine St,Suite 304,Springfield,OR,92744,555-1234,,555-9999
John Val’ery,123 Any Street,,Anytown,MA,01034,,,
Barney & Betty Rubble,”3 Rock Terrace, Apt 2b”,,Bedrock,MT,88432-2433,,,
Tucker Case,533b Ridge Rd,Pine Grove,CA,92315,,555-5309,
|
Using simple Unix utilities, you could write programs to handle the data. So if you already have tools to process the address book, why convert it to XML?
XML offers some real advantages, both for this particular application and for larger systems:
1. XML Is the Extensible Markup Language. It has no predefined set of tags, so it can be used to represent any form of data: Web pages, technical documentation, database tables, etc. Unlike HTML which has a fixed set of tags, XML allows us to have tag names that make sense for this application’s <address>, <name>, etc.
2. XML Formats Are Self-Describing. Because each piece of information in XML is identified with a name, it’s easy to understand. The data in Listing Two isn’t well described. You can infer the meaning of some of the fields, but you’d have to build software that knows what a phone number looks like, for example.
3. XML Is an Open Standard. There are no proprietary aspects that open source developers will have to work around or reverse-engineer. In addition, the format is text-based, so you can use standard Unix tools to search and manipulate the data. This also means you can get off- the-shelf parsers that will work in interoperable ways.
4. XML Is Based on Unicode. This means that it is fully internationalized out of the box. There are relatively few constraints on the characters that may be used, freeing document authors and processing systems to work in the language that is most convenient for their locale.
Listing Three shows what the first address record from Listing Two looks like in XML.
Listing Three: The First Address in XML
<address>
<name>Jane Smith</name>
<street>15 Pine St</street>
<street>Suite 304</street>
<city>Springfield</city>
<state>OR</state>
<zip>92744</zip>
<phone type=”work”>555-1234</phone>
<fax>555-9999</fax>
</address>
|
Because XML is an open standard, it is quickly becoming the format of choice for open source projects that need to store and/or exchange structured information. In fact, XML is on the road to becoming a universal enabler that allows applications and services to communicate with each other.
Here are just a few of the popular open source projects using XML today:
Jabber The popular instant messaging system uses XML as its native messaging format. In fact, Jabber is quickly evolving from an instant messaging platform into a more generic distributed XML routing system. (More on this in an upcoming article…)
Zope As a heavily object-based system, Zope uses XML to represent objects and data whenever they need to be imported or exported from its object database.
Mozilla While “skins” for applications and desktops have become quite popular, Mozilla implements an XML-based language called XUL, pronounced “zool” (from the Ghostbusters movie), which allows you to completely configure its user interface.
|
Basic Features of XML
The basic building blocks of XML documents are elements and attributes. So let’s take a look at these.
Elements
Elements are delimited by angle brackets. Elements that have content appear in the form <element-name> followed by some optional content followed by </element-name>. The first form is the start tag, the second is the end tag. In XML, every non-empty element must have explicit start and end tags, and they must be properly nested. Unlike HTML, you may never omit end tags.
Empty elements have the form <element-name/>. The trailing slash before the closing angle bracket indicates that the element has no content and consequently no end tag is allowed. In Listing One, you know that the xref element is empty because it has a closing />.
Element names must begin with a name character and may contain any number of letters, digits, hyphens, underscores, and periods. The full repertoire of name characters is given in the XML recommendation. It’s considerably larger than the US ASCII set of alphabetic characters.
Attributes
Element start tags may additionally have attributes. (End tags may not.) In Listing Three, Jane Smith’s phone number is identified with an attribute. Attributes are name-value pairs. Like element names, attribute names must conform to the XML notion of a name. All attribute values must be quoted with either double or single quotes. Listing Four shows some examples.
Listing Four: Attributes in XML
<book lang=”en”>
<imagedata fileref=”http://example.com/somefile“/>
<foo nested-quotes=”’single in double’”>
<bar nested-quotes=’”double in single”‘>
<baz id=”test” multiple-attributes=”ok”>
|
Character Entities
Converting the next address book record, shown in Listing Five, introduces some new ideas.
Listing Five: The Second Address in XML
<address>
<name>John Valéry</name>
<street>123 Any Street</street>
<city>Anytown</city>
<state>MA</state>
<zip>01034</zip>
</address>
|
First, note what we’ve done with the \’; we’ve turned it into é. That’s because a lowercase “e” with an acute accent is located at position 233 in Unicode.
XML reserves a small number of characters, like <, to identify document markup. In order to use these characters literally in your document, they must be escaped somehow.
There are several escaping mechanisms. One of the most general is the numeric character reference. Any Unicode character can be referenced in your document using a numeric character reference, regardless of the character encoding of the document itself. These have the form nnnn;, where nnnn is the Unicode character number in decimal. If you prefer to use hexadecimal, then you can use the format xhhhh;.
Using numeric character references, you can put a literal < character in your document like this:
The common markup characters (<, >, &, ‘, and “) need to be escaped so frequently in XML documents that XML defines mnemonic references, called entities, for them, as summarized in Table One.
Table One: Predefined Character Entities
Character Entity Numeric
< < <
> > >
& & &
‘ ' '
” " "
|
For the purpose of our sample address book, US ASCII character encoding is sufficient. If you were using an encoding that contained the accented character directly (ISO Latin 1, for example) then you could just put that character in the document.
The other thing to notice about Listing Five is that the <street> element occurs only once. In the comma-delimited format, it was important to put the blank second field in the file. In XML this is not necessary. There’s no need to count commas any more and there’s no reason to insert empty fields.
There’s little new to learn in the third or fourth address, so it is time to pull all of the addresses together into an XML document. To do that, however, there are a few more concepts to investigate.
Comments and Processing Instructions
Suppose you want to insert notes to yourself in an XML file and don’t want them processed. Or, maybe you need to provide hints directly to your XML processor. You can do so with comments and processing instructions.
Comments
XML comments, like comments in most programming languages, are meant for annotations that aren’t expected to have an influence on subsequent processing. Many XML processors discard comments when parsing. A comment looks like this:
Unfortunately, XML comments cannot be nested. The sequence – is forbidden inside a comment.
Processing Instructions
Sometimes you want to pass additional information to a specific processing application. For example, you may want to tell the application that is processing your data what filename to use for some piece of output or where it’s acceptable to insert a line break in a long title.
Processing instructions are the way to pass this information to a processing application. Processing instructions have the form:
<?target any-data-you-want?>
All processing instructions must begin with a target. The target is simply a name. Everything after the name, up to the closing ?>, is part of the processing instruction. Although there is no requirement to do so, it has become traditional to use pseudo-attribute syntax in processing instructions, like this:
<?addressbook preferred-phone=”work”?>
Target names beginning with xml (in any combination of upper and lower case) are reserved. Note that target names cannot be namespace qualified, so if you’re making one up, try to make it globally unique in some other way.
Building XML Documents
Now it’s time to pull the address records together into a whole document. There are only two things left to do — add a document element around the records and add an XML declaration to the top.
For historical reasons related to XML 1.0 validity checking, an XML document must have a single, outer-most element called the document or root element. In this case, it makes sense to use <addressbook>.
For now, just think of the XML declaration as identifying the file as XML (and specifically, the version of XML that we’re using — there’s only one, 1.0).
Listing Six (pg. 46) shows our entire address book. Note the comment, the processing instruction, and the escaped & in the third address.
Listing Six: The Address Book in XML
<?xml version=’1.0′?>
<!– Converted from comma separated value form on 8 May 2001 –>
<?addressbook preferred-phone=’work’?>
<addressbook>
<address>
<name>Jane Smith</name>
<street>15 Pine St</street>
<street>Suite 304</street>
<city>Springfield</city>
<state>OR</state>
<zip>92744</zip>
<phone type=”work”>555-1234</phone>
<fax>555-9999</fax>
</address>
<address>
<name>John Valéry</name>
<street>123 Any Street</street>
<city>Anytown</city>
<state>MA</state>
<zip>01034</zip>
</address>
<address>
<name>Barney & Betty Rubble</name>
<street>3 Rock Terrace, Apt 2b</street>
<city>Bedrock</city>
<state>MT</state>
<zip>88432-2433</zip>
</address>
<address>
<name>Tucker Case</name>
<street>533b Ridge Rd</street>
<city>Pine Grove</city>
<state>CA</state>
<zip>92315</zip>
<phone type=”home”>555-5309</phone>
</address>
</addressbook>
|
Well Formed Documents
In order for a document to truly be considered XML, it must be what is known as a “Well Formed Document.” Being well formed means that a document satisfies certain minimum requirements.
In brief, these include:
- Element start and end tags are properly nested.
- Attribute values do not contain literal < characters (the < entity is okay).
- All numeric character references refer to Unicode characters that are legal in XML documents.
- Named entities are used correctly (declared, parsable, and not recursive).
This is not an exhaustive list, but it is sufficient for an introductory understanding. The bottom line, however, is that documents that aren’t well formed simply aren’t XML.
The Document Prolog
Most XML documents begin with a prologue. Occasionally, the prologue can be quite extensive, but it usually contains just two things — the XML Declaration and the Document Type Declaration (DTD).
The XML Declaration, which must be the first thing in an XML file if it is present at all, serves three purposes:
1. It identifies the XML version to which the document conforms. At this time, the version must be 1.0. The version number is required.
2. It may identify the character encoding used by the document. Although XML uses Unicode, there are many different ways to represent the characters of a document (ISO Latin 1, KOI8-R, Big5, etc.). If the declaration is not present, XML systems attempt to “sniff” the encoding (from MIME headers or the file system, for example). XML Processors are only required to support the UTF-8 and UTF-16 encodings, but in practice, most support many more.
3. It may identify itself as a standalone document. Standalone documents assert that there are no external declarations that affect the information passed to the processing application. In practice, it is not often used.
The format of an XML Declaration is:
<?xml version=’1.0′ encoding=”US-ASCII” standalone=’no’?>
where the version is required and the other parts are optional. (The above is the declaration used on the source document for this article.)
The other common prologue element is the Document Type Declaration:
This declaration serves two purposes: it identifies the root element of the document (<element-name>) and it associates an external set of declarations, the DTD (Document Type Definition), with the document. The DTD is used to assist the parser in “validating” the document.
Validity
One of the important features that an XML parser can provide is the ability to validate a document. A DTD contains a set of declarations that identify additional rules and constraints that a document must satisfy in order to be considered a valid instance of the document type associated with that DTD.
For example, a DTD might state that every <address> record is required to have <name>, <street>, and <phone> elements. It may go on to specify the types of data that those elements may contain.
A document is valid if and only if it satisfies the constraints of the DTD with which it is associated. Examples include:
- For every element used in the document, there must be exactly one element declaration that identifies the legal content model for that element.
- Every attribute’s value is legal according to the declared type of the attribute.
- All attributes declared as “required” are actually specified in the document.
As with the well-formed constraints, this is not an exhaustive list. There are a number of additional validity constraints that are beyond the scope of this article.
Why Bother with Validity?
There are a great many applications that can be satisfied with nothing more rigorous than well-formedness. For these applications, it is sufficient to ignore data that wasn’t expected or to process it in some default way.
But there are also a lot of applications that need more control. If you wouldn’t expect to find an expense report in the middle of your purchase order, or an HTML <div> in the middle of your MathML formula, you better check to make sure that the author didn’t put one there. XML validity is an easy way to satisfy the expected constraints when the document is being parsed.
The XML Recommendation describes just one kind of validity, based on XML DTDs. And while DTDs have a lot of power, they aren’t always the best tool for the job. As a result, there are now several other languages that can be used (instead of DTDs) to specify the validity constraints for a given XML document. These languages are often referred to as schema languages. Some of the more popular schema languages are W3C XML Schema, RELAX, TREX, and Schematron. You’ll find references for these schema languages in the Resources sidebar (pg. 48).
Namespaces
Many people are excited by the prospect of mixing “XML vocabularies” together. A vocabulary, in this case, means a specific set of XML elements and attributes; HTML is a vocabulary, as are DocBook, MathML, SVG (Scalable Vector Graphics), and every other defined set of tag names that you have seen. Historically, vocabularies were defined only by DTDs, but today you see vocabularies defined by other schema languages as well.
Mixing vocabularies together allows you to apply things you already know to solve new problems. For example, a purchase order document might allow HTML elements in the description of an item, or an SVG document might allow a MathML equation as part of its content.
This raises an important question; how can elements from different vocabularies be distinguished? For example, both HTML and SVG define a <title> element, but they are quite different. If you are mixing SVG graphics into a DocBook document, how do you distinguish between them?
A namespace gives elements and attributes globally unique names. It does this by associating names with Uniform Resource Identifiers (URIs, see http://www.w3.org/Addressing for details). URIs are long and tedious to type, and they include characters that aren’t legal in XML names, so the Namespaces Recommendation defines a shortcut.
Using an attribute-like syntax, the declaration xmlns associates a namespace prefix with a URI:
The prefix e: is bound to the specified URI on the <doc> element and all of its children. Logically, what this says is, “this occurrence of the <doc> element is the one defined by the namespace http://example.org/xmlns/example.” Similarly, the <para> element is the one defined by that namespace name as well.
It is important to note that the prefix is irrelevant. This document is exactly the same as the preceding one:
If you’re going to be using predominantly one namespace, you can make that the default by declaring it with xmlns =”…”. This next document is logically identical to the preceding two:
Note, however, that the following document is completely different from the ones above:
<doc>
<para>A paragraph.</para>
</doc>
Because it lacks a namespace declaration, its elements are different from those of all other namespaces.
So if you want to mix two vocabularies, you can do it with multiple namespaces as show in Listing Seven (pg. 47).
Listing Seven: The Second Address in XML
|
The default namespace is HTML and the svg: namespace is SVG, so the two title elements are entirely distinct.
You can change namespaces on the fly, so the example in Listing Seven can even be written like Listing Eight.
Listing Eight: The Second Address in XML
|
In Listing Eight, the two title elements are still distinct. Because the svg element redefines the default namespace, the namespace name of the SVG title element is http://www.w3.org/2000/svg, even though it has no explicit prefix. When the svg element ends, the previous declaration goes back into effect.
</article>
XML is the best way to store structured information. It is an open, accessible technology that is being actively developed and widely deployed. If you’re interested in writing your own XML-aware applications, there are lots of XML libraries out there; there’s probably one available for your favorite language.
We’ve just barely scratched the surface of XML. Keep an eye out in future issues for an article about XML programming using Java, Perl, and other languages.
In the meantime, whether you are a hardcore open source developer or just beginning to explore the Linux waters, this article has hopefully given you a basic understanding of what XML is made of.
Norman Walsh, XML Standards Engineer in Sun’s Technology Development Group, is chair of the OASIS DocBook Technical Committee. He can be reached at norman.walsh@sun.com.
No comments yet.