pagetop
Javablog
by Java coders, for Java codersRSS

Reading and Validating XML in Java with XPath

November 15th, 2008 by Sam

A year and a half ago, I posted Parsing XML. At the time, my focus was directed at parsing very large datasets of relatively simple XML formats… the Wikipedia Datadumps being a good example. Since then, I’ve found myself needing to parse smaller, well defined XML files that can be realistically loaded into memory. In this post, I will highlight all the XML parsing approaches available in Java and tell you why you should be using XPath with XSD validation.

The approaches available are:-

  • XML SAX, which applies the Hollywood Principle and can churn through enormous amounts of data. However, you have to manually keep track of where you are in the document tree so is not suitable for complicated formats. This is a standard part of J2SE.
  • XML DOM is a very low-level API for accessing XML documents like a tree, but the API is not at all suitable for end users. Like all the options below, DOM requires the XML to be loaded into memory. Think of this as being as close to the metal as you can get, without looking at the raw text. This is a standard part of J2SE.
  • XmlPull API is somewhere between SAX and DOM, but its main selling point is the J2ME implementation kXML2.
  • StAX and the SUN implementation is the “official” version of the XmlPull API. I’ve not seen a J2ME implementation.
  • Javolution has a suite of XML libraries including SAX, StAX and binding. Javolution is aimed at mobile and embedded programming. It’s not really geared toward enterprise or desktop development. Javolution is one of those projects that has always really impressed me, but it seems to be doing so many things that I always find something more specialist to use instead.
  • XStream which focuses entirely on the conversion of XML to and from POJOs.
  • JDOM was created because XML DOM is so horrible to work with. Admittedly, it is a more programmer friendly API… but it is not a part of standard J2SE and won’t be anytime soon. Given the innovation in Java’s XML support lately, I expect this project to slowly fade away.
  • NanoXML is a very lightweight parser, with a more programmer friendly API than DOM. I consider it to be like JDOM, interesting but having served its purpose. There are rumours of a J2ME port existing.
  • Pattern… seriously guys, that leads to the dark side, don’t even think about it
  • JAXB, the Java Architecture for XML Binding. This project does for XML what JPA did for database access. It allows the programmer to define Java objects and then be able to convert them to and from XML, and all the validation to happen magically. I’ve not used this yet, but it seems very promising and I’ll write up a post when I work out how to use it. All I know for now is that it is a hefty API and Netbeans created a lot of boilerplate code when I did try to use it. Overkill for just reading XML, methinks. The production quality implementation is available on java.net, but is now also a part of J2SE.
  • XPath allows you to enter in simple path based queries to an XML document and get the contents out. It is quite fantastic and a pleasure to use. Combined with validation, you can assume that all your XML is exactly as it should be and forget about checking that elements exist or contain the right information. And there is a J2ME XPath implementation!

in terms of XML Validation, there are several specifications for XML validation but I will only look at the W3C’s recommended approach, the W3C XML Schema (XSD) format.

The Validation Stage

XML Schema Definition files are incredibly powerful, if you use XML you should always invest the effort into writing the definition file. Unfortunately it has a very steep learning curve and there aren’t many good intro tutorials out there, but try the w3c schools tutorial to begin.

If you’ve defined your spec and validate any XML that you parse, you can be rest assured that any formatting errors will only occur in the validation stage, not when you are reading the contents. That’s incredible… think about how fantastic that is for a moment. You never need to mix your parsing and validation code, ever again!

Let’s define a very simple XML format that we’ll use in this example, that looks something like this:-

<?xml version="1.0" encoding="UTF-8"?>
<javablog:post dateTime="2008-10-19T20:00:00+00:00"
  url="http://javablog.co.uk/2008/10/19/mutable-entries-in-a-collection/">
  <title>Mutable entries in a Collection</title>
</javablog:post>

a XSD for this format might look something like

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="javablog" elementFormDefault="qualified" targetNamespace="javablog">
    <xsd:element name="post" type="PostType"/>
    <xsd:complexType name="PostType">
        <xsd:all>
            <xsd:element name="title" type="xsd:string" minOccurs="1" maxOccurs="1"/>
        </xsd:all>
        <xsd:attribute name="dateTime" type="xsd:dateTime" use="required"/>
        <xsd:attribute name="url" type="xsd:anyURI" use="required"/>
    </xsd:complexType>
</xsd:schema>

Note that the XML format ensures that the dateTime is a valid timestamp (in a format Java will parse for us), that url is a valid URI (with a protocol) and that the title is always present. Let’s write the Java code to validate input of this sort, and read it into a Document object.

Unfortunately, the Java XML packages make heavy use of factories and builders, and factories of builders, which are not thread safe and end up resulting in a lot of boilerplate in your code. But, sucking up the pain and getting on with life, the following should go into your constructors for any classes that want to have XML reading capability, and take in a Source object that defines the XSD:-

public MyClass(Source xsd) throws DatatypeConfigurationException, SAXException {
    Preconditions.checkNotNull(xsd);
    transFactory = TransformerFactory.newInstance();
    xPathFactory = XPathFactory.newInstance();
    try {
        docBuilderFactory = DocumentBuilderFactory.newInstance();
        docBuilderFactory.setNamespaceAware(true);
        dtFactory = DatatypeFactory.newInstance();
        SchemaFactory schemaFactory = SchemaFactory.newInstance(
            XMLConstants.W3C_XML_SCHEMA_NS_URI);
        schema = schemaFactory.newSchema(xsd);
    } catch (Exception e) {
        // it is so stupid that this throws a SAXException, an implementation detail
        if (e instanceof SAXException)
            throw new IllegalArgumentException("schemaText not valid: " + e.getMessage());
        // see http://javablog.co.uk/2008/10/18/runtimeexceptions-and-gurus-failing-to-meditate/
        throw new GuruMeditationFailure(e);
    }
}

and yes, I know that is really really ugly. A possible way to create a Source is by a StreamSource.

Then to perform validation on any Source against that XSD, it is as simple as

Validator validator = schema.newValidator();
try {
    validator.validate(source);
} catch (SAXException e) {
    // again, completely stupid that this is a SAXException (implementation detail)
    // instead of a more generic "XmlValidationException"
    // validation failed! put some failure logic here
} catch (IOException ex) {
    // shouldn't be possible with DOM
    throw new GuruMeditationFailure(ex);
}

which admittedly, is also unwieldy. But from this point onward you know that your Source has everything you defined in the XSD.

XPath Queries

The IBM Java XPath article starts with a 30-line DOM program to find particular elements within an XML file. The equivalent XPath is a single line. For our XML format, the code to obtain the title of a javablog:post element from a Document object is below. Note that a more common and useful way to create a Source is by creating a DOMSource which you get from the core object of the DOM model, Document… which you get from DocumentBuilder#parse. Naturally, DocumentBuilder is not thread safe, so you need to create a new one of those from the DocumentBuilderFactory we made in the constructor. Sigh.

XPath xPath = xPathFactory.newXPath();
xPath.setNamespaceContext(nsContext);
// need to have your input 'doc' as a Node object, e.g. a Document object
String title = xPath.evaluate("/J:post/title/text()", doc);
String dateString = xPath.evaluate("/J:post@dateTime", doc);
XMLGregorianCalendar dateTime = dtFactory.newXMLGregorianCalendar(dateString);
// URI constructor will never fail
URI url = new URI(xPath.evaluate("/J:post@url", doc));

Unfortunately, XPath objects are not thread safe (completely stupid design decision!), making compilation of them next to useless. To throw more spanners in the works… if you want to use shorthand for namespaces (here we use “J:” instead of “javablog:”), you need to implement your own NamespaceContext. My own implementation is close enough to the simple O’Reilly implementation to make it pointless to reproduce here. For our example, you’d need to add the mapping “J” -> “javablog”, or write all your XPath queries explicitly (I think). Again, the lack of a standard implementation was a very very stupid idea on the part of the SUN developers. Also, if you want to get a Date object out of that dateTime, you’ll need to first convert it to a GregorianCalendar, then call the getTime() method. Really ugly, but at least it’ll never fail!

So the point is:-

  • W3C Schema Definitions (XSDs) let you separate validation from logic
  • XPaths queries are a serious timesaver and make code very readable
  • the Java XML API is possibly the ugliest and stupidest API you will ever see in your life

So, for my next blog post I’m going to remove the third point and leave you with all the Schema and XPath goodness.


This entry was posted by by Sam on Saturday, November 15th, 2008 at 1:17 am, and is filed under API, Advanced, Validation, XML, XPath. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.


Leave a comment

Markdown is supported.

To include code snippets in your comment, use

<pre><code># lang java
... code here ...
</code></pre>

or use 4 spaces at the start of the line instead of using code and pre tags.

Comment feed: RSS