XML Schema Languages

Validating XML documents using schemas will reduce your stress level. We explain what validation is and take a look at four different ways of doing it.

It’s late in the afternoon (or the wee hours of the morning, depending on your coding preference) and you want to stop for the day. But you can’t. The XML application you’re writing has developed a mysterious bug. All the regression tests pass, but when you run the application and have it try to process your boss’s XML document, the whole thing crashes. You’ve read the code a dozen times; you know it works. What’s wrong?

The odds are good that your boss’s document contains invalid XML data, or is using XML elements and/or attributes in a way that you never anticipated. If you were using an XML schema when reading in the source document, you would have found this bug hours ago. Instead of crashing, your application would have notified you that it was given invalid XML data.

In past articles we’ve examined the fundamentals of XML and taken an introductory look at programming with it. (See http://www.linux-mag.com/2001-07/xml_basics_01.html and http://www.linux-mag.com/2001-10/xmldom_01.html/.) In this article, we’ll turn our attention to XML Schema languages, which are used to define an XML schema for an application to use.

There are several alternative schema languages competing for mind share at the moment. In this article, we’ll examine the four most common languages and look briefly at their features. We’ll also look at how you can use them in your applications today.

What Is a Schema?

A schema is nothing more than a set of constraints. These constraints can apply to both the content and attributes of an XML element (comments and processing instructions are ignored and inaccessible to most schema languages).

Note that sometimes the term “schema” is used to mean an abstract set of constraints and sometimes it is used to mean a particular document that contains a schema in some specific language. This distinction is not often important, but it can lead to some confusion if you’re considering several alternative schema languages.

Throughout this article, we’ll examine four schema languages: XML 1.0 DTDs, W3C XML Schema, RELAX NG, and Schematron. There are other languages, but these four are by far the most common.

The most important benefit of a schema is that it allows you to test the validity of an XML document. An XML document is said to be valid with respect to a particular schema if it violates none of the constraints of that schema.

What Is Validation?

Validation is the process that answers the question: Does this particular document conform to this specific schema? The principal result of validation is an answer, either “yes” or “no.” As we’ll see later on, validation may also provide useful side-effects for an application.

The specific tool for, or means of, performing validation on a document varies depending on the schema language that you are using.

You should validate your XML documents because this will prevent errors and simplify subsequent processing. XML is structured data. Whenever you write an application or stylesheet that processes an XML document, it’s almost always going to be designed to handle only a certain class of documents.

For example, if you write an application that is designed to print mailing labels from XML addresses, it probably isn’t going to do anything useful if you hand it an SVG (Scalable Vector Graphic) diagram or the text of Shakespeare’s Romeo and Juliet.

If you were reading the addresses from an unstructured or binary format, you’d have to build error-checking into your application. But with XML, you can use validation to greatly simplify your task.

By passing the input document through a validator, you can determine beforehand if it’s going to have the structure you expect (and reject it outright if it doesn’t). By the same token, you can write your applications to accept all of the possible valid structures because they are identified by the schema.

As soon as you start working with XML vocabularies that are a little bit unfamiliar, or vocabularies that have more elements than you can remember off the top of your head, you’re bound to run into some trouble if you don’t validate.

Even for small documents, there’s no substitute for validation. It can be extremely frustrating to stare at a document a dozen times or more before realizing that you’ve used <lastname> instead of <surname> and that you’d not turned on validation. Validation would have revealed the error in seconds.

What Can You Validate?

Validation can check two different kinds of constraints — constraints on simple types and those on complex types. The specific nature of the constraints that a given schema language can express is the principal feature that distinguishes the languages from one another.

SImple Types: Simple types are atomic strings that occur in a document. They are often attribute values, but some schema languages allow you to define element content in terms of simple types. Strings, numbers, positive integers, and dates are all examples of simple types.

In order for a string to satisfy the constraints of a simple type, it must first be a simple string of characters (meaning that it can’t contain any elements). Secondly, it must be in the “lexical space” of the type that it is being tested against. This is just a fancy way of saying that it must “look like” a value of the right type. So, this means that an integer must contain only decimal digits, possibly with a leading sign character; if it’s a date, it must look like an ISO 8601 date, and so on.

Complex Types: Complex types can be a little harder to understand. A complex type is an arrangement of elements, attributes, and possibly text. Complex types represent things like postal addresses, chapters of a book, and purchase orders (anything with sub-element structure).

Types of Constraints

There are two kinds of constraints that you can place on an element; you can limit the number and name of its attributes and you can constrain if, where, and how many times it occurs with respect to other elements.

An element satisfies the constraints on its attributes if it satisfies the following conditions:

  1. It has all of its required attributes.

  2. It has no attributes that aren’t allowed by the schema. Since attributes must have unique names and cannot be repeated on a single element, this is a fairly easy constraint to check.

Constraining where an element occurs is the role of the content model. Every element that allows sub-element structure has a content model associated with it. This content model describes what elements may occur directly inside it and in what order they may occur.

A complex type can therefore be seen as the combination of a content model and a set of attribute constraints.

For example, a U.S. postal address might be constrained to contain only street, city, state, and ZIP elements, or a chapter in a book might be constrained to contain exactly one title followed by paragraphs.

Understanding Content Models

Content models are a little bit like regular expressions, although they are nowhere near as flexible. In constructing a content model, you have four tools at your disposal: sequences, choices, repetition, and grouping.

Sequences: A sequence specifies that several elements must occur one after another. For example, a U.S. postal address must contain a city, followed by a state, followed by a ZIP code.

Choices: A choice specifies that any one of a number of elements can occur. For example, a chapter might be allowed to consist of paragraphs, tables, and figures.

Repetition: Repetition allows you to express that an element may be repeated. For example, an address might allow one or more street elements. All of the schema languages allow you to specify that an element may appear exactly once, at most once (optionality), one or more times, and zero or more times. Some languages provide even finer control.

Grouping: Grouping allows you to associate repetition with an entire sequence or choice. For example, a chapter might allow the choice of paragraph, table, or figure to be repeated any number of times. This effectively allows any combination of paragraphs, tables, and figures to occur.


In addition to returning a Boolean result for the question; “Is this document valid?”, schema processors may provide additional information for an application. In the case of W3C XML Schema, a sophisticated Post Schema Validation Infoset is described, while other languages define the results more simply (or not at all).

There are two ways of looking at these side effects:

On the one hand, they allow the schema processor to provide additional, useful information to applications. This means that applications using schema-validated documents don’t have to recalculate values that the schema processor must have calculated in order to check the validity of the document.

Suppose, for example, that your schema identifies the type of an attribute value as a duration of less than 100 years. In order for the schema processor to determine if your document is valid, it will have to:

  1. Examine the string value of the attribute,

  2. Attempt to convert it into a duration,

  3. Check that the duration is within the allowed range.

That’s quite a bit of work, and the schema validator has to do it to determine validity, regardless of whatever subsequent processing might be performed.

Providing a mechanism for the validator to pass the typed information on to the application frees the application author from the burden of reprocessing the attribute string in order to extract the duration for its purposes. This improves efficiency and avoids one class of application errors.

On the other hand, these side effects perform a subtle sort of transformation. They can make the same document appear different depending on whether or not schema processing was performed.

In the simple case of DTD validation, for example, a non-validating parser may not provide default attribute values that were specified in the DTD schema. This means that an application such as a stylesheet may behave quite differently depending on exactly what kind of parser was used.

Because developers are often not aware that they are actually able to make choices about the kind of processors used, these side-effects can sometimes result in rather confusing errors.

Type Information

As we just described, one of the most useful features of a schema validator is that it provides datatype information. When looking at a validated document, your application might expect that an attribute declared as an integer would be identified specifically as an integer.

There are existing applications, such as XSLT processors, that rely on this extra information for some of their features. XSLT stands for “XSL Transformations,” where XSL itself stands for “Extensible Stylesheet Language.” XSLT is the standard method for transforming XML into HTML or plain text (we’ll be looking at XSLT processors in a future article).

In order for the XSLT id() function to find ID values, the document must have been validated with a parser that provides information about which attribute values are in fact of type “ID.”

Default Values

Another side effect that validators can provide is the ability to set default values for attribute or element content. For example, if an address does not specify a country, the country “US” can be provided automatically. The value of a default attribute is literally provided by the validator. From the point of view of any subsequent processing, the value will be present, and it may not even be possible to tell whether it was included within the original document or provided by the validator.

Some validators also provide the ability to specify “fixed values” for attributes or element content. A fixed value is like a default in that it will be provided if it is absent, but if it is present, it must match the fixed value or values specified in the schema.

It must be noted that supplying default values is not, strictly speaking, part of validation. In fact, in XML 1.0, even non-validating parsers are required to provide default attribute values if they encounter declarations for them. Some schema languages, such as RELAX NG, go so far as to specify defaulting behavior, which is entirely separate from validation.

Programming with Schemas

Unfortunately, a comprehensive explanation of any one of the common schema languages would require far more space than is available in a single magazine article. So instead, we’ll look at each of them briefly and then attempt to point out some of their distinguishing features. Be sure to take a look at Resources (pg. 33) for more specific information, particularly the report of the Schema Languages Comparison Town Hall Meeting from the XML 2001 conference.

Continuing with the postal address example, let’s look at a (simple) schema for validating US postal addresses. Figure One shows what an acceptable element might look like.

Figure One: Validating U.S. Postal Addresses

<name>John Smith</name>
<street>123 Any Street</street>

A postal address might be described this way: an optional name, a post office box or up to three lines of street address, a city, state, and a ZIP code. Because this format will only work for postal addresses located in the United States, we’ll add a fixed attribute to the address that will indicate this. Finally, we’ll allow an optional ID attribute on addresses too, so that we can give addresses unique identifiers for locating them easily in our documents.

Choosing a schema language and writing the schema is most of the battle, but before we’re finished, we have to actually use the schema to perform validation. There are lots of ways this can be accomplished, and exactly which mechanism you choose will depend on your needs and your programming environment.

There are standard XML parsing modules for most common languages, including C/C++, Perl, Python, and Tcl. The examples we’ll explore here are in Java using toolkits from Apache and Sun Microsystems. In the interest of space, we’ll only present the definition of the address elements and small fragments of the source code for our validation examples. You can get the complete text of each of the schemas and the full source code online at http://www.linux-mag. com/downloads/xml_schema.tar.gz.

DTD Validation

Historically, DTDs (Document Type Definitions) were inherited from SGML, and they are the only form of schema described by the XML 1.0 Recommendation. The principal advantage of DTDs is that they are supported by every validating XML 1.0 parser. Also, they have well understood and agreed upon semantics, and they are compact. Unfortunately, for many modern applications, their advantages are outweighed by their disadvantages, some of which are listed below:

  • A very small selection of simple datatypes and no mechanism for adding user-defined datatypes,

  • A customization and extension mechanism based entirely on macro-like string substitution,

  • Extremely limited support for XML Namespaces,

  • A different notation — DTDs are not written in element and attribute syntax like XML documents.

The declaration for an address in a DTD appears in Figure Two.

Figure Two: An Address in XML 1.0 DTD

<!ELEMENT address (name?, (pobox|street+), city, state, zip)>
<!ATTLIST address
country CDATA #FIXED “US”

DTDs use a simple string-based syntax to express content models. The comma separator is used to identify a sequence, vertical bars to identify a choice. Grouping is provided by parentheses. The postfix operators “?“, “+“, and “*” identify repetition. They indicate optionality, one or more times and zero or more times, respectively. A name (or group) with no repetition operator must appear exactly once.

DTDs provide a few mechanisms for constraining simple types in attribute values, but they provide no mechanisms for such constraints in element content. There is no way to express that the content of the state element must be a valid U.S. state abbreviation or that the ZIP code must contain an integer, let alone an integer in a specific range. (Using integers to constrain the ZIP code is a bit of a hack but it provides a simple example; in real life, a regular expression or some other mechanism would be preferable.)

Attributes are declared separately. They are composed of a name, a type (there are only a handful of types in XML DTDs), and either a default value or a keyword that indicates if they are optional (#IMPLIED), required (#REQUIRED), or fixed (#FIXED).

Validating with DTDs is easy. Any validating XML 1.0 parser will, by definition, perform DTD validation. (For more information on parsing XML, please refer to our October 2001 article located at http://www.linux-mag.com/2001-10/xmldom_01.html or the Resources sidebar.)

Figure Three illustrates an example of constructing a validating SAX parser using JAXP.

Figure Three: Validating with XML 1.0 DTDs

SAXParserFactory factory = null;
SAXParser parser = null;

factory = SAXParserFactory.newInstance();
factory.setValidating(true); // Enable validation

parser = factory.newSAXParser();

Any documents that are parsed with this parser will be validated according to the DTD they specify. One unique feature of DTDs is that they must be referenced by the document to be validated. The document type declaration (the line that begins <!DOCTYPE …) at the beginning of an XML document identifies the DTD that applies to that document. If the document type declaration is missing, there is no standard mechanism for selecting an alternate DTD to use for validation.

W3C XML Schema Validation

The next generation of schema validation from the W3C is XML Schemas. Unlike DTDs, XML Schemas are written using XML elements and attributes instead of a special notation. XML Schemas offer a rich library of built-in datatypes and a type hierarchy that separates type definition from element declaration and allows element types to be derived by restriction or extension. In other words, you can derive an international address type by extending the definition of some base address type, or you can derive an age datatype from integer by restricting it to values strictly greater than 0 and less than 100.

The type definition for an address in a W3C XML Schema appears in Figure Four.

Figure Four: An Address in W3C Schema

<xs:complexType name=’Address’>
<xs:element ref=’name’/>
<xs:element ref=’poBox’/>
<xs:element ref=’street’ minOccurs=’1′
<xs:element name=’city’ type=’xs:string’/>
<xs:element name=’state’ type=’StateAbbrev’/>
<xs:element name=’zip’ type=’ZipCode’/>
<xs:attribute name=”id” type=”xs:ID”/>
<xs:attribute name=”country” type=”xs:string” fixed=”US”/>

W3C XML Schemas allow for the separation of types from element declarations. In this example, we show the address type (the xs:complexType) and the element declaration (the xs:element) that associates the address element with this type. Separation of type and declaration allows multiple elements to be easily constructed from the same type.

The content model is expressed using the elements xs:sequence and xs:choice; these elements provide grouping automatically. Repetition is handled by the minOccurs and maxOccurs attributes. Note that existing W3C XML Schemas allow for arbitrary repetitions.

Many of the elements and attributes here are declared in terms of built-in datatypes, but state and zip are defined in terms of user-defined types. This allows the schema to constrain their values to the U.S. states and reasonable ZIP codes. (The definitions of the user-defined types StateAbbrev and ZipCode are not shown here, but they are fully defined in the address schema that is available for download).

The xs:attribute element declares attributes. They have a name, a type (which may be drawn from the full palette of built-in XML Schema datatypes), and may be specified as being either optional, required, or fixed.

Support for W3C XML Schemas is starting to show up in parser libraries. Recent releases of the popular Xerces parser from Apache, for example, now include W3C XML Schema validation.

The example in Figure Four is based on the XercesJ 1.4.3 APIs. (For more information on Xerces, check out the Apache Project’s XML pages. You’ll find the URL in Resources).

In Figure Five, we begin by establishing a SAX Parser factory, just as we did for DTD validation. Note, however, that we explicitly enable namespace awareness; this is necessary for W3C XML Schema validation with Xerces. Because Xerces also uses schema validation through features on the XML Reader imple-mentation, we must set this after getting the XML Reader.

Figure Five: Validating with W3C XML Schemas

SAXParserFactory factory = null;
SAXParser parser = null;

factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true); // Enable Namespace awareness
factory.setValidating(true); // Enable validation

parser = factory.newSAXParser();

XMLReader reader = parser.getXMLReader(); // Get the reader

validation/schema”, true);
http://apache.org/xml/features/validation/schema-full-checking“, true);

if (schema != null) { // Possibly use an alternate schema
System.out.println(“Validating with ” + schema);

Normally, the document will use the schema location hints to identify which schema should be used. However, W3C XML Schemas allow us to force the document to be processed with an arbitrary schema regardless of any hints that might be in the document.

The “no namespace” schema is selected because this address won’t be located in any namespace. If you’re using namespaces, you want to make sure that you set the http:// apache.org/xml/properties/schema/external-schemaLocation property or properties (if you have more than one namespace) appropriately.

For more information on XML Namespaces, please refer again to our July 2001 article on XML, located at http:// www.linux-mag.com/2001-07/xml_basics_01.html/.

OASIS RELAX NG Schema Validation

The RELAX NG specification is the work of the OASIS RELAX NG Technical Committee. OASIS, the Organization for the Advancement of Structured Information Standards, is an international, not-for-profit consortium that designs and develops industry standard specifications for interoperability based on XML.

This work represents the unification of two other schema languages, TREX and RELAX. RELAX NG is built on a strong theoretical foundation and provides a number of features not available elsewhere, including co-constraints and the ability to use not only the W3C XML Schema datatypes, but also alternate datatype libraries. RELAX NG offers little support for validation side-effects. In fact, ID identification is provided by an ancillary specification; features such as fixed attribute values and fixed or defaulted element content are not provided.

The declarations for an address in a RELAX NG schema appears in Figure Six .

Figure Six: An Address in RELAX NG

<element name=”address”>
<attribute name=”country” a:defaultValue=”US”>
<attribute name=”id”>
<data type=”ID”/>
<ref name=”Name”/>
<ref name=”POBox”/>
<ref name=”Street”/>
<element name=”city”><text/></element>
<element name=”state”>
<ref name=”StateAbbrev”/>
<element name=”zip”>
<ref name=”ZipCode”/>

Within the model for RELAX NG Schemas, elements form a sequence unless they are contained in a choice. A group element allows for grouping. Additionally, there are elements that allow the schema to express repetition: oneOrMore, optional, and zeroOrMore. Elements can be declared “in place” by using element. Alternatively, you can use ref to reference them to other elements.

RELAX NG Schemas can define components that may be reused. Here, StateAbbrev and ZipCode components constrain the content of state and zip, respectively.

Attribute declarations and element declarations in RELAX NG are treated uniformly. The same elements used to declare element content are used to declare attribute content and optionality. The ability to specify default values for attributes is provided by a separate annotation specification, shown in use here with the a:defaultValue annotation.

Because there is no direct equivalent of fixed values in RELAX NG, this schema achieves that behavior by specifying that the allowed content of the country attribute is a choice with a single possible value.

Figure Seven shows an example of validating with RELAX NG. In this case, Sun’s Multi-Schema XML Validator (MSV) is being used. MSV has a slightly different factory setup than the SAX Parser.

Figure Seven: Validating with RELAX NG

String schemaURI = “http://relaxng.org/ns/structure/0.9“;
VerifierFactory factory = VerifierFactory.newInstance(schemaURI);

Verifier verifier = factory.newVerifier(schema);

if (verifier.verify(xmlfile)) {
System.out.println(“Document is valid, full speed ahead!”);
} else {
System.out.println(“Document is not valid according to ” + schema);

As its name implies, MSV can validate using several different schema languages. The schemaURI is used to determine which type of verifier to construct. In the full source for this program, you’ll see that there’s a command-line switch to select either RELAX NG or W3C XML Schema validation.

After constructing a factory for the validator (or verifier) we want, we must then build a verifier for the particular schema that we’re using. The schema variable contains the URI of the file containing the schema we wish to use for validation.

Schematron Validation

Schematron, developed principally by Rick Jelliffe at the Academia Sinica Computing Center, takes an entirely different approach to validation. Unlike the other schemas we have looked at, Schematron is not based on the declaration of a grammar that must be matched, nor does it have a content model like the other schema languages. Instead, it relies on the validation of “tree patterns.” As a result, a Schematron system does not have the ability to provide validation side-effects.

This technique provides tremendous new power; you could, for example, design a Schematron schema that required every city name to be two words or have an odd number of letters. In short, you can assert that any expression be true in any given context.

Using Schematron is sometimes a bit awkward, but there are many constraints that can’t be expressed easily, or even at all, in a grammar-based schema. One area of growing interest is the combination of Schematron with other validation strategies, effectively combining the best of both worlds.

The declarations for an address in a Schematron schema are in Figure Eight.

Figure Eight: An Address in Schematron

<pattern name=”Valid Address”>
<rule context=”address”>
<assert test=”count(name) &lt; 2″>Must have at most one name</assert>
<assert test=”count(pobox) != 0 or count(street) != 0″>Must have pobox or
<assert test=”count(pobox) = 0 or count(street) = 0″>Must have only one of
pobox or street</assert>
<assert test=”count(pobox) &lt; 2″>Must have at most one pobox</assert>
<assert test=”count(street) &lt;= 3″>May have at most three lines of street
<assert test=”count(city)=1″>Must have exactly one city</assert>
<assert test=”count(state)=1″>Must have exactly one state</assert>
<assert test=”count(zip)=1″>Must have exactly one zip</assert>
<assert test=”not(@country) or @country=’US’”>Country must be US</assert>

In this schema, we express our constraints literally using Xpath expressions. XPath, the XML Path Language, will be described in greater detail in our future article on XSL Transformations. In the example above, we test for an optional name element by asserting that the number of name elements is less than two (i.e., zero or one, since the number of elements that are present cannot possibly be negative). Attribute and element constraints are handled simply by using the appropriate expressions.

Note that these XPath expressions occur within the context of an XML document, so the markup characters that are significant to XML must be escaped. That’s why we use “&lt;” instead of “<” in our expressions.

Schematron is very different from the other schema languages. Instead of being a grammar-based language like the others, it’s a rule-based language. It turns out that XSLT, which we’ve mentioned before, is a convenient language for expressing these rules.

If you aren’t at all familiar with XSLT, that’s perfectly okay. You can use Schematron without a complete understanding of XSL Transformations; its validation tools can write all the XSLT stylesheets you’ll need.

The first step in performing Schematron validation is to convert your Schematron schema into an XSLT stylesheet (using an XSLT stylesheet provided in the Schematron implementation). Figure Nine shows a small part of the Schematron XSLT stylesheet for addresses. Only the tests for the first two assertions are shown.

Figure Nine: Validating with Schematron

<axsl:template match=”address” priority=”4000″ mode=”M1″>
<axsl:when test=”count(name) &lt; 2″/>
<axsl:otherwise>In pattern count(name) &lt; 2:
Must have one name
<axsl:when test=”count(pobox) != 0 or count(street) != 0″/>
<axsl:otherwise>In pattern count(pobox) != 0 or count(street) != 0:
Must have pobox or street

The next step is to process your source document with the XSLT stylesheet derived from your Schematron schema. If there are assertions that fail, the stylesheet will produce messages to that effect.

Using a Schematron stylesheet directly in your application would then require processing the documents with an XSLT Processor (such as the Apache Project’s Xalan) and examining the results.

Choosing a Tool

In this article, we’ve examined the role and benefits of validation as well as a number of specific validation tools that you can use in your various XML projects. Which specific tool you elect to use will depend on many factors, but hopefully you’ve seen that validation of some form is always a good idea.


Norman Walsh actively participates in a number of standards efforts worldwide, including the XML Core. He can be reached at norman.walsh@sun.com.

Comments are closed.