This is a fairly light topic to begin the blog with, but everyone has to parse XML one day and there are many ways to do it in Java. Most readers are probably already aware that Java now ships with DOM and SAX, but sometimes these techniques are just too heavyweight.
I’ll start by briefly going through SAX and DOM for those who are not familiar with them, and then present a light-weight alternative for each.
DOM
The DOM (Document Object Model) is an interface that exposes an XML document as a tree structure comprised of nodes. The DOM allows you to navigate the tree and edit any of its elements.
DOM is not really appropriate for general parsing of XML files, it is more of a template for complicated, open format file types. The API is quite intimidating, but the SUN DOM tutorial (although a little preoccupied by JFrame representations) does give a decent tour of what is possible.
The ability to navigate the XML document as a tree is a great advantage when the structure is complex, but it does involve storing all elements in memory making it completely unfit for purpose on large XML files.
SAX
SAX is a very scalable asynchronous parser which is without a doubt the best solution if you want performance and the ability to parse large XML files. The SUN SAX tutorial completely overcomplicates the API. To parse an XML file, wrap its stream as an InputSource, obtain an XMLReader instance by calling a factory method and implement the DefaultHandler dummy class to deal with parsing this particular flavour of XML, overriding the relevant methods. You set the parser running and then the handler receives asynchronous events when various things occur.
XMLReader parser = XMLReaderFactory.createXMLReader();
parser.setContentHandler(new MyHandler());
parser.parse(new InputSource(inputStream));
As an example parsing exercise, consider we have lots of <entry> elements that have <title> and <text>, we would then be interested in overriding only three of the methods in the handler: startElement, characters and endElement. We keep track of our depth in the tree by manually switching boolean fields title and text when we enter and exit them.
@Override
public void startElement(final String nsURI, final String localName,
final String rawName, final Attributes attributes)
throws SAXException {
if ("title".equalsIgnoreCase(rawName)) {
title = true;
} else if ("text".equalsIgnoreCase(rawName)) {
text = true;
}
}
@Override
public void endElement(final String nsURI, final String localName,
final String rawName) throws SAXException {
if ("title".equalsIgnoreCase(rawName)) {
title = false;
} else if ("text".equalsIgnoreCase(rawName)) {
text = false;
}
}
@Override
public void characters(final char ch[], final int start,
final int length) {
if (title) {
// add to the title StringBuilder here
} else if (text) {
// add to the text StringBuilder here
}
}
Note that characters can be called an arbitrary number of times with pieces of the element contents. It is probably best to initialise StringBuilders at the start of an element, adding more pieces of the element in each call of characters and then pass the information on when you hit the end of the element. I typically extend DefaultHandler with the class I wish to do all the XML parsing and setup methods such as gotText(String text) that get called at the end of the element.
I like the SAX interface, but it does involve writing a lot of boilerplate code for every new file you want to parse and handling deep true structures or namespaces can be a nightmare.
XML Pull
The XML Pull API is a fantastic little parser API for synchronously parsing XML files. The kXML2 implementation weighs in at 10k and works in J2ME. I’ve used it in OpenLAPI for parsing Google Earth files and it was dead simple to use.
It doesn’t store XML files in memory, but instead of being asynchronous like SAX, it requires you to ask the parser to move to the next significant element, where you can then ask for names and properties. If you’re not used to asynchronous APIs you may even prefer this over SAX, but be warned that you won’t get the same performance.
It allows you to write methods that can return Java Objects that are called when you enter a particular element. These methods can then take over the parsing until the element and its contents have been read, returning something meaningful… something you can’t do with SAX. In this example, we parse a Google Earth KML file looking for Placemark elements and delegating the parsing of them off to another method that returns a Location object whilst moving the parsing cursor along.
int event = parser.next();
for (; event != XmlPullParser.END_DOCUMENT; event = parser.next()) {
if (event != XmlPullParser.START_TAG)
continue;
String name = parser.getName();
if ("Placemark".equals(name)) {
// if it's a Placemark parse it
Location location = parsePlacemark();
doSomething(location);
}
// else ignore it
}
NanoXML Lite
NanoXML has a SAX wrapper, but that’s not what I found interesting about it. It has a NanoXML/Lite flavour which is so small (5k) I mistook it for a J2ME library and got a day’s coding done before realising it wasn’t.
The Lite version doesn’t implement DOM or SAX. However, because it loads the entire XML file into memory, it can provide tree navigation. Each element is parsed into an Element object which contains its attributes, data and children elements, making for really easy coding.
If you have small enough XML files and aren’t after performance, I’d certainly recommend NanoXML/Lite. The API is just so damn intuitive.
Daniel wrote:
April 9th, 2007 at 11:45 am