pagetop
Javablog
by Java coders, for Java coders RSS

XML Parsing

March 29th, 2007 by Sam

This is a fairly light topic to begin the blog with, but everyone has to parse XML one day and there are many ways to do it in Java. Most readers are probably already aware that Java now ships with DOM and SAX, but sometimes these techniques are just too heavyweight.

I’ll start by briefly going through SAX and DOM for those who are not familiar with them, and then present a light-weight alternative for each.

DOM

The DOM (Document Object Model) is an interface that exposes an XML document as a tree structure comprised of nodes. The DOM allows you to navigate the tree and edit any of its elements.

DOM is not really appropriate for general parsing of XML files, it is more of a template for complicated, open format file types. The API is quite intimidating, but the SUN DOM tutorial (although a little preoccupied by JFrame representations) does give a decent tour of what is possible.

The ability to navigate the XML document as a tree is a great advantage when the structure is complex, but it does involve storing all elements in memory making it completely unfit for purpose on large XML files.

SAX

SAX is a very scalable asynchronous parser which is without a doubt the best solution if you want performance and the ability to parse large XML files. The SUN SAX tutorial completely overcomplicates the API. To parse an XML file, wrap its stream as an InputSource, obtain an XMLReader instance by calling a factory method and implement the DefaultHandler dummy class to deal with parsing this particular flavour of XML, overriding the relevant methods. You set the parser running and then the handler receives asynchronous events when various things occur.

    XMLReader parser = XMLReaderFactory.createXMLReader();
    parser.setContentHandler(new MyHandler());
    parser.parse(new InputSource(inputStream));

As an example parsing exercise, consider we have lots of <entry> elements that have <title> and <text>, we would then be interested in overriding only three of the methods in the handler: startElement, characters and endElement. We keep track of our depth in the tree by manually switching boolean fields title and text when we enter and exit them.

    @Override
    public void startElement(final String nsURI, final String localName,
            final String rawName, final Attributes attributes)
            throws SAXException {
        if ("title".equalsIgnoreCase(rawName)) {
            title = true;
        } else if ("text".equalsIgnoreCase(rawName)) {
            text = true;
        }
    }
    
    @Override
    public void endElement(final String nsURI, final String localName,
            final String rawName) throws SAXException {
        if ("title".equalsIgnoreCase(rawName)) {
            title = false;
        } else if ("text".equalsIgnoreCase(rawName)) {
            text = false;
        }
    }
    
    @Override
    public void characters(final char ch[], final int start,
            final int length) {
        if (title) {
            // add to the title StringBuilder here
        } else if (text) {
            // add to the text StringBuilder here
        }
    }

Note that characters can be called an arbitrary number of times with pieces of the element contents. It is probably best to initialise StringBuilders at the start of an element, adding more pieces of the element in each call of characters and then pass the information on when you hit the end of the element. I typically extend DefaultHandler with the class I wish to do all the XML parsing and setup methods such as gotText(String text) that get called at the end of the element.

I like the SAX interface, but it does involve writing a lot of boilerplate code for every new file you want to parse and handling deep true structures or namespaces can be a nightmare.

XML Pull

The XML Pull API is a fantastic little parser API for synchronously parsing XML files. The kXML2 implementation weighs in at 10k and works in J2ME. I’ve used it in OpenLAPI for parsing Google Earth files and it was dead simple to use.

It doesn’t store XML files in memory, but instead of being asynchronous like SAX, it requires you to ask the parser to move to the next significant element, where you can then ask for names and properties. If you’re not used to asynchronous APIs you may even prefer this over SAX, but be warned that you won’t get the same performance.

It allows you to write methods that can return Java Objects that are called when you enter a particular element. These methods can then take over the parsing until the element and its contents have been read, returning something meaningful… something you can’t do with SAX. In this example, we parse a Google Earth KML file looking for Placemark elements and delegating the parsing of them off to another method that returns a Location object whilst moving the parsing cursor along.

    int event = parser.next();
    for (; event != XmlPullParser.END_DOCUMENT; event = parser.next()) {
        if (event != XmlPullParser.START_TAG)
            continue;
        String name = parser.getName();
        if ("Placemark".equals(name)) {
            // if it's a Placemark parse it
            Location location = parsePlacemark();
            doSomething(location);
        }
        // else ignore it
    }

NanoXML Lite

NanoXML has a SAX wrapper, but that’s not what I found interesting about it. It has a NanoXML/Lite flavour which is so small (5k) I mistook it for a J2ME library and got a day’s coding done before realising it wasn’t.

The Lite version doesn’t implement DOM or SAX. However, because it loads the entire XML file into memory, it can provide tree navigation. Each element is parsed into an Element object which contains its attributes, data and children elements, making for really easy coding.

If you have small enough XML files and aren’t after performance, I’d certainly recommend NanoXML/Lite. The API is just so damn intuitive.


This entry was posted by by Sam on Thursday, March 29th, 2007 at 4:17 pm, and is filed under J2ME, Java, Parsing, XML. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.



3 comments on “XML Parsing”

Good introductory article.
A word about a small gotcha I’ve encountered re. unicode. Although Java Strings use unicode, the input and output streams typically don’t by default. To get UTF-8 support (and hence avoid lots of ?s appearing in your text), you need to wrap your InputStream with a UTF-8 Reader. E.g. for Sax:

input = new InputSource(new InputStreamReader(inputStream, "UTF-8"));

This issue crops up in other Java IO tasks, so it’s good to be aware of it.

I wonder if there was a good reason for “UTF-8” not being the default charset, or if this is a legacy issue where someone screwed up. The most annoying thing is that the FileReader class (and its associates) also use the default charset, meaning that although Java is completely UTF-8 safe, the convenience I/O classes aren’t.

Because the XML processing libraries, including org.xml.sax.InputSource, can accept streams, they implicitly promise they may be able to handle the conversion of the bytes produced by these streams into characters. And they do—-if the “xml” processing instruction contains the “encoding” attribute, then the processor will honor the attribute and use whatever encoding is specified in the XML. This takes some burden away from the program, which need not make assumptions about the encoding, placing some burden on XML authoring, where you might as well specify an encoding.

What happens when the “encoding” attribute is missing? It’s up to your tool, but I believe that the standard libraries fall back to the platform encoding—-ISO-8859-1 on Windows-based operating systems, if I’m not mistaken.

Leave a comment

Markdown is supported.

To include code snippets in your comment, use

<pre><code># lang java
... code here ...
</code></pre>

or use 4 spaces at the start of the line instead of using code and pre tags.

Comment feed: RSS