This post details my experiments parsing the same document with the usual-suspects - DOM, SAX, and Pull parsing - and comparing the results for readability and performance - especially for Android. The parsing mechanisms compared here are:

  1. W3C DOM parsing
  2. W3C DOM and XPath
  3. SAX Parsing
  4. Pull Parsing
  5. dsl4xml (dsl around Pull-parser)
  6. SJXP (thin Pull-parser wrapper using xpath-like expressions)

I hope to add more later - some contenders include: jaxb; xstream; and Simple.

The code for the entire project is in github. You will need to Maven install the dsl4xml library if you want to run the tests yourself, as I'm afraid I don't have a public repo for it yet.

Important Note: This experiment was inspired by some work I did to optimise a slow Android app, where the original authors had used mostly DOM parsing with a sprinkling of XPath.

My ultimate aim was to run these perf tests on one or more real Android devices and show how they compare there.

For this reason if you look at the project in github, you'll see that I've imported the Android 4 jar and used only the parser implementations that are available without additional imports in Android. (OK, the two pull-parser wrappers require very small standalone jars, sorry).

The Android project and Activity for running the tests on a device is in a separate project here.

The XML

The XML file being parsed is a Twitter search result (Atom feed). You can see the actual file here, but this is a snippet of the parts I'm interested in parsing for these tests (the 15 <entry>'s in the document):

<?xml version="1.0" encoding="UTF-8"?>
<feed .. >
  ..
  <entry>
    ..
    <published>2012-04-09T10:10:24Z</published>
    <title>Tweet title</title>
    <content type="html">Full tweet content</content>
    ..
    <twitter:lang>en</twitter:lang>
    <author>
        <name>steveliles (Steve Liles)</name>
        <uri>http://twitter.com/steveliles</uri>
    </author>
  </entry>
  ..
</feed>

The POJO's

The Java objects we're unmarshalling to are very simple and don't need any explanation. You can see them in Github here.

Parsing the Twitter/Atom feed

First, just a few notes on what I'm trying to do. I basically want to compare two things:

  1. Readability/maintainability of typical parsing code.
  2. Parsing performance with said typical parsing code, incl. under concurrent load.

With that in mind, I've tried to keep the parsing code small, tight, and (AFAIK) typical for each mechanism, but without layering any further libraries or helper methods on top.

In working with each parsing mechanism I have tried to choose more performant approaches where the readability trade-off is not high.

Without further ado, lets see what parsing this document and marshalling to Java objects is like using the various libraries.

W3C DOM

DOM (Document Object Model) parsing builds an in-memory object representation of the entire XML document. You can then rummage around in the DOM, going and back and forth between elements and reading data from them in whatever order you like.

Because the entire document is read into memory, there is an upper limit on the size of document you can read (constrained by the size of your Java heap).

Memory is not used particularly efficiently either - a DOM may consist of very many sparsely populated List objects (backed by mostly empty arrays). A side effect of all these objects in memory is that when you're finished with them there's a lot for the Garbage Collector to clean up.

On the plus side, DOM parsing is straight-forward to work with, particularly if you don't care much about speed and use getElementsByTagName() wherever possible.

The actual code I used for the performance test is here, but this is roughly what it ended up looking like:

private DocumentBuilder builder;
private DateFormat dateFormat;

public DOMTweetsReader() 
throws Exception {
    DocumentBuilderFactory factory = 
        DocumentBuilderFactory.newInstance();
    builder = factory.newDocumentBuilder();
    dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
}

@Override
public String getParserName() {
    return "W3C DOM";
}

public Tweets read(InputStream anInputStream) 
throws Exception {
    Document _d = builder.parse(anInputStream, "utf-8");
    Tweets _result = new Tweets();
    unmarshall(_d, _result);
    return _result;
}

public void unmarshall(Document aDoc, Tweets aTo) 
throws Exception {
    NodeList _nodes = aDoc.getChildNodes().item(0).getChildNodes();
    for (int i=0; i&lt;_nodes.getLength(); i++) {
        Node _n = _nodes.item(i);
        if ((_n.getNodeType() == Node.ELEMENT_NODE) && 
            ("entry".equals(_n.getNodeName())
         ){
            Tweet _tweet = new Tweet();
            aTo.addTweet(_tweet);
            unmarshallEntry((Element)_n, _tweet);
        }
    }
}

private void unmarshallEntry(Element aTweetEl, Tweet aTo)
throws Exception {
    NodeList _nodes = aTweetEl.getChildNodes();
    for (int i=0; i&lt;_nodes.getLength(); i++) {
        Node _n = _nodes.item(i);
        if (_n.getNodeType() == Node.ELEMENT_NODE) {                    
            if ("published".equals(_n.getNodeName())) {                         
                aTo.setPublished(dateFormat.parse(getPCData(_n)));
            } else if ("title".equals(_n.getNodeName())) {
                aTo.setTitle(getPCData(_n));
            } else if ("content".equals(_n.getNodeName())) {
                Content _content = new Content();
                aTo.setContent(_content);
                unmarshallContent((Element)_n, _content);
            } else if ("lang".equals(_n.getNodeName())) {
                aTo.setLanguage(getPCData(_n));
            } else if ("author".equals(_n.getNodeName())) {
                Author _author = new Author();
                aTo.setAuthor(_author);
                unmarshallAuthor((Element)_n, _author);
            }
        }
    }
}

private void unmarshallContent(Element aContentEl, Content aTo) {
    aTo.setType(aContentEl.getAttribute("type"));
    aTo.setValue(aContentEl.getNodeValue());
}

private void unmarshallAuthor(Element anAuthorEl, Author aTo) {
    NodeList _nodes = anAuthorEl.getChildNodes();
    for (int i=0; i&lt;_nodes.getLength(); i++) {
        Node _n = _nodes.item(i);
        if ("name".equals(_n.getNodeName())) {
            aTo.setName(getPCData(_n));
        } else if ("uri".equals(_n.getNodeName())) {
            aTo.setUri(getPCData(_n));
        }
    }
}

private String getPCData(Node aNode) {
    StringBuilder _sb = new StringBuilder();
    if (Node.ELEMENT_NODE == aNode.getNodeType()) {
        NodeList _nodes = aNode.getChildNodes();
        for (int i=0; i&lt;_nodes.getLength(); i++) {
            Node _n = _nodes.item(i);
            if (Node.ELEMENT_NODE == _n.getNodeType()) {
                _sb.append(getPCData(_n));
            } else if (Node.TEXT_NODE == _n.getNodeType()) {
                _sb.append(_n.getNodeValue());
            }
        }
    }
    return _sb.toString();
}

Its worth noting that I would normally extract some useful utility classes/methods - for example getPCData(Node) - but here I'm trying to keep the sample self-contained.

Note that this code is not thread-safe because of the unsynchronized use of SimpleDateFormat. I am using separate instances of the Reader classes in each thread for my threaded tests.

W3C DOM and XPath

XPath is a language for describing locations within an XML document as paths from a starting location (which can be the root of the document (/), the current location (.//) or anywhere (//)).

I've used XPath on and off for years, mostly in XSLT stylesheets, but also occasionally to pluck bits of information out of documents in code. It is very straight-forward to use.

Here's a sample for parsing our Twitter Atom feed. The actual test code is in github.

private DocumentBuilder builder;
private XPathFactory factory;

private XPathExpression entry;
private XPathExpression published;
private XPathExpression title;
private XPathExpression contentType;
private XPathExpression content;
private XPathExpression lang;
private XPathExpression authorName;
private XPathExpression authorUri;

private DateFormat dateFormat;

public DOMXPathTweetsReader() 
throws Exception {
    DocumentBuilderFactory _dbf = 
        DocumentBuilderFactory.newInstance();
    _dbf.setNamespaceAware(true);
    builder = _dbf.newDocumentBuilder();
    factory = XPathFactory.newInstance();

    NamespaceContext _ctx = new NamespaceContext() {
        public String getNamespaceURI(String aPrefix) {
            String _uri;
            if (aPrefix.equals("atom"))
                _uri = "http://www.w3.org/2005/Atom";
            else if (aPrefix.equals("twitter"))
                _uri = "http://api.twitter.com/";
            else
                _uri = null;
            return _uri;
        }

        @Override
        public String getPrefix(String aArg0) {
            return null;
        }

        @Override
        @SuppressWarnings("rawtypes")
        public Iterator getPrefixes(String aArg0) {
            return null;
        }
    };

    entry = newXPath(factory, _ctx, "/atom:feed/atom:entry");
    published = newXPath(factory, _ctx, ".//atom:published");
    title = newXPath(factory, _ctx, ".//atom:title");
    contentType = newXPath(factory, _ctx, ".//atom:content/@type");
    content = newXPath(factory, _ctx, ".//atom:content");
    lang = newXPath(factory, _ctx, ".//twitter:lang");
    authorName = newXPath(factory, _ctx, ".//atom:author/atom:name");
    authorUri = newXPath(factory, _ctx, ".//atom:author/atom:uri");

    dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
}

private XPathExpression newXPath(
    XPathFactory aFactory, NamespaceContext aCtx, String anXPath
) throws Exception {
    XPath _xp = factory.newXPath();
    _xp.setNamespaceContext(aCtx);
    return _xp.compile(anXPath);
}

@Override
public String getParserName() {
    return "W3C DOM/XPath";
}

@Override
public Tweets read(InputStream anInputStream)
throws Exception {
    Tweets _result = new Tweets();
    Document _document = builder.parse(anInputStream);

    NodeList _entries = (NodeList) 
        entry.evaluate(_document, XPathConstants.NODESET);                  
    for (int i=0; i&lt;_entries.getLength(); i++) {
        Tweet _tweet = new Tweet();
        _result.addTweet(_tweet);

        Node _entryNode = _entries.item(i);

        _tweet.setPublished(getPublishedDate(_entryNode));
        _tweet.setTitle(title.evaluate(_entryNode));
        _tweet.setLanguage(lang.evaluate(_entryNode));

        Content _c = new Content();
        _tweet.setContent(_c);

        _c.setType(contentType.evaluate(_entryNode));
        _c.setValue(content.evaluate(_entryNode));

        Author _a = new Author();
        _tweet.setAuthor(_a);

        _a.setName(authorName.evaluate(_entryNode));
        _a.setUri(authorUri.evaluate(_entryNode));
    }

    return _result;
}

private Date getPublishedDate(Node aNode) 
throws Exception {
    return dateFormat.parse(published.evaluate(aNode));
}

The code ends up being quite easy to read and can be written to nest in a way that mimics the document structure. There is a very big downside - as you'll see later - the performance is atrocious.

SAX Parser

SAX stands for Simple API for XML. It uses a "push" approach: whereas with DOM you can dig around in the document in whatever order you like, SAX parsing is event-driven which means you have to handle the data as it is given to you.

SAX parsers fire events when they encounter the various components that make up an XML file. You register a ContentHandler whose methods are called-back when these events occur (for example when the parser finds a new start element, it invokes the startElement method of your ContentHandler).

The API assumes that the consumer (ContentHandler) is going to maintain some awareness of its state (e.g. where it currently is within the document). I sometimes use a java.util.Stack to push/pop/peek at which element I'm currently working in, but here I can get away with just recording the name of the current element.

I'm extending DefaultHandler because I'm not interested in many of the events (it provides a default empty implementation of those methods for me).

The actual test code is in github, and is actually more complex in order to handle entity-refs via a LexicalHandler, but here's the gist of it:

private XMLReader reader;
private TweetsHandler handler;

public SAXTweetsReader() 
throws Exception {
    SAXParserFactory _f = SAXParserFactory.newInstance();
    SAXParser _p = _f.newSAXParser();
    reader = _p.getXMLReader();
    handler = new TweetsHandler();
    reader.setContentHandler(handler);
}

@Override
public String getParserName() {
    return "SAX";
}

@Override
public Tweets read(InputStream anInputStream) 
throws Exception {
    reader.parse(new InputSource(anInputStream));
    return handler.getResult();
}

private static class TweetsHandler extends DefaultHandler {

    private DateFormat dateFormat = 
        new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
    private Tweets tweets;
    private Tweet tweet;
    private Content content;
    private Author author;
    private String currentElement;

    public Tweets getResult() {
        return tweets;
    }

    @Override
    public void startDocument() throws SAXException {
        tweets = new Tweets();
    }

    @Override
    public void startElement(
        String aUri, String aLocalName, 
        String aQName, Attributes aAttributes
    ) throws SAXException {
        currentElement = aQName;
        if ("entry".equals(aQName)) {
            tweets.addTweet(tweet = new Tweet());
        } else if ("content".equals(aQName)) {
            tweet.setContent(content = new Content());
            content.setType(aAttributes.getValue("type"));
        } else if ("author".equals(aQName)) {
            tweet.setAuthor(author = new Author());
        }
    }

    @Override
    public void endElement(
        String aUri, String aLocalName, String aQName
    ) throws SAXException {
        currentElement = null;
    }

    @Override
    public void characters(char[] aCh, int aStart, int aLength)
    throws SAXException {
        if ("published".equals(currentElement)) {
            try {
                tweet.setPublished(dateFormat.parse(
                    new String(aCh, aStart, aLength))
                );
            } catch (ParseException anExc) {
                throw new SAXException(anExc);
            }
        } else if (
            ("title".equals(currentElement)) &&
            (tweet != null)
        ) {
            tweet.setTitle(new String(aCh, aStart, aLength));
        } else if ("content".equals(currentElement)) {
            content.setValue(new String(aCh, aStart, aLength));
        } else if ("lang".equals(currentElement)) {
            tweet.setLanguage(new String(aCh, aStart, aLength));
        } else if ("name".equals(currentElement)) {
            author.setName(new String(aCh, aStart, aLength));
        } else if ("uri".equals(currentElement)) {
            author.setUri(new String(aCh, aStart, aLength));
        }
    }
}

One downside when handling more complicated documents is that the ContentHandler can get littered with intermediate state objects - for example here I have the tweet, content, and author fields.

Another is that SAX is very low level and you have to handle pretty much everything - including that text nodes are passed to you in pieces when there are entity-references present.

Pull Parser

Pull-parsing is the "pull" to SAX parsing's "push". SAX pushes content at you by firing events as it encounters constructs within the xml document. Pull-parsing lets you ask for (pull) the next significant construct you are interested in.

You still have to take the data in the order it appears in the document - you can't go back and forth through the document like you can with DOM - but you can skip over bits you aren't interested in.

Test code is in github, this is roughly what it looks like:

private DateFormat dateFormat;
private XmlPullParserFactory f;
private Tweets tweets;
private Tweet currentTweet;
private Author currentAuthor;

public PullParserTweetsReader() 
throws Exception {
    dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
    f = XmlPullParserFactory.newInstance();
    f.setNamespaceAware(true);
}

@Override
public String getParserName() {
    return "Pull-Parser";
}

@Override
public Tweets read(InputStream anInputStream) throws Exception {
    XmlPullParser _p = f.newPullParser();
    _p.setInput(anInputStream, "utf-8");
    return parse(_p);
}

private Tweets parse(XmlPullParser aParser) 
throws Exception {
    tweets = new Tweets();

    int _e = aParser.next();
    while (_e != XmlPullParser.END_DOCUMENT) {
        if (_e == XmlPullParser.START_TAG) {
            startTag(aParser.getPrefix(), aParser.getName(), aParser);
        }
        _e = aParser.next();
    }

    return tweets;
}

private void startTag(String aPrefix, String aName, XmlPullParser aParser)
throws Exception {
    if ("entry".equals(aName)) {
        tweets.addTweet(currentTweet = new Tweet());
    } else if ("published".equals(aName)) {
        aParser.next();
        currentTweet.setPublished(dateFormat.parse(aParser.getText()));
    } else if (("title".equals(aName)) && (currentTweet != null)) {
        aParser.next();
        currentTweet.setTitle(aParser.getText());
    } else if ("content".equals(aName)) {
        Content _c = new Content();
        _c.setType(aParser.getAttributeValue(null, "type"));
        aParser.next();
        _c.setValue(aParser.getText());
        currentTweet.setContent(_c);
    } else if ("lang".equals(aName)) {
        aParser.next();
        currentTweet.setLanguage(aParser.getText());
    } else if ("author".equals(aName)) {
        currentTweet.setAuthor(currentAuthor = new Author());
    } else if ("name".equals(aName)) {
        aParser.next();
        currentAuthor.setName(aParser.getText());
    } else if ("uri".equals(aName)) {
        aParser.next();
        currentAuthor.setUri(aParser.getText());
    }
}

SJXP (Pull-Parser wrapper)

The first of the pull-parser wrappers under test, I stumbled upon this one yesterday. I liked the idea behind it so decided to give it a try.

I'm a big fan of callbacks generally, and having spent quite some time working with XPath in the past the idea of using XPath-like syntax to request callbacks from the pull-parser seems tempting.

There was one problem I couldn't work around which seems like either a gap in my knowledge (and the documentation) or an irritating bug - when declaring the paths you have to use the full namespace uri even on elements in the default namespace.

This means that my path declarations even on this shallow document are enormous and I had to split them onto three lines to fit the width of my blog.

Code is in github, this is the gist of it:

private Tweet currentTweet;
private DateFormat dateFormat;
private XMLParser&lt;Tweets> parser; 

private IRule&lt;Tweets> tweet = new DefaultRule&lt;Tweets>(Type.TAG, 
    "/[http://www.w3.org/2005/Atom]feed" +
    "/[http://www.w3.org/2005/Atom]entry"
) {
    public void handleTag(
        XMLParser&lt;Tweets> aParser, boolean aIsStartTag, Tweets aUserObject) {
        if (aIsStartTag)
            aUserObject.addTweet(currentTweet = new Tweet());
    }   
};

private IRule&lt;Tweets> published = new DefaultRule&lt;Tweets>(Type.CHARACTER, 
    "/[http://www.w3.org/2005/Atom]feed" +
    "/[http://www.w3.org/2005/Atom]entry" +
    "/[http://www.w3.org/2005/Atom]published"
) {
    public void handleParsedCharacters(
        XMLParser&lt;Tweets> aParser, String aText, Tweets aUserObject
    ) {
        try {                   
            currentTweet.setPublished(dateFormat.parse(aText));
        } catch (ParseException anExc) {
            throw new XMLParserException("date-parsing problem", anExc);
        }
    }           
}; 

private IRule&lt;Tweets> title = new DefaultRule&lt;Tweets>(Type.CHARACTER, 
    "/[http://www.w3.org/2005/Atom]feed" +
    "/[http://www.w3.org/2005/Atom]entry" +
    "/[http://www.w3.org/2005/Atom]title"
) {
    public void handleParsedCharacters(
        XMLParser&lt;Tweets> aParser, String aText, Tweets aUserObject
    ) {
        currentTweet.setTitle(aText);
    }           
};

IRule&lt;Tweets> content = new DefaultRule&lt;Tweets>(Type.TAG, 
    "/[http://www.w3.org/2005/Atom]feed" +
    "/[http://www.w3.org/2005/Atom]entry" +
    "/[http://www.w3.org/2005/Atom]content" +
) {
    public void handleTag(
        XMLParser&lt;Tweets> aParser, boolean aIsStartTag, Tweets aUserObject
    ) {
        if (aIsStartTag)
            currentTweet.setContent(new Content());
        super.handleTag(aParser, aIsStartTag, aUserObject);
    }
};

private IRule&lt;Tweets> contentType = new DefaultRule&lt;Tweets>(Type.ATTRIBUTE, 
    "/[http://www.w3.org/2005/Atom]feed" +
    "/[http://www.w3.org/2005/Atom]entry" +
    "/[http://www.w3.org/2005/Atom]content", "type"
) {
    public void handleParsedAttribute(
        XMLParser&lt;Tweets> aParser, int aIndex, String aValue, Tweets aUserObject
    ) {                 
        currentTweet.getContent().setType(aValue);
    }
};

private IRule&lt;Tweets> contentText = new DefaultRule&lt;Tweets>(Type.CHARACTER, 
    "/[http://www.w3.org/2005/Atom]feed" +
    "/[http://www.w3.org/2005/Atom]entry" +
    "/[http://www.w3.org/2005/Atom]content"
) {
    public void handleParsedCharacters(
        XMLParser&lt;Tweets> aParser, String aText, Tweets aUserObject
    ) {                 
        currentTweet.getContent().setValue(aText);
    }
};

private IRule&lt;Tweets> lang = new DefaultRule&lt;Tweets>(Type.CHARACTER, 
    "/[http://www.w3.org/2005/Atom]feed" +
    "/[http://www.w3.org/2005/Atom]entry" +
    "/[http://api.twitter.com/]lang"
) {
    public void handleParsedCharacters(
        XMLParser&lt;Tweets> aParser, String aText, Tweets aUserObject
    ) {
        currentTweet.setLanguage(aText);
    }
};

private IRule&lt;Tweets> author = new DefaultRule&lt;Tweets>(Type.TAG, 
    "/[http://www.w3.org/2005/Atom]feed" +
    "/[http://www.w3.org/2005/Atom]entry" +
    "/[http://www.w3.org/2005/Atom]author"
) {
    public void handleTag(
        XMLParser&lt;Tweets> aParser, boolean aIsStartTag, Tweets aUserObject
    ) {
        if (aIsStartTag)
            currentTweet.setAuthor(new Author());
        super.handleTag(aParser, aIsStartTag, aUserObject);
    }
};

private IRule&lt;Tweets> authorName = new DefaultRule&lt;Tweets>(Type.CHARACTER, 
    "/[http://www.w3.org/2005/Atom]feed"
    "/[http://www.w3.org/2005/Atom]entry" +
    "/[http://www.w3.org/2005/Atom]author" +
    "/[http://www.w3.org/2005/Atom]name"
) {
    public void handleParsedCharacters(
        XMLParser&lt;Tweets> aParser, String aText, Tweets aUserObject
    ) {
        currentTweet.getAuthor().setName(aText);
    }
};

private IRule&lt;Tweets> authorUri = new DefaultRule&lt;Tweets>(Type.CHARACTER, 
    "/[http://www.w3.org/2005/Atom]feed" +
    "/[http://www.w3.org/2005/Atom]entry" +
    "/[http://www.w3.org/2005/Atom]author" +
    "/[http://www.w3.org/2005/Atom]uri"
) {
    public void handleParsedCharacters(
        XMLParser&lt;Tweets> aParser, String aText, Tweets aUserObject
    ) {
        currentTweet.getAuthor().setUri(aText);
    }
};

@SuppressWarnings("all")
public SJXPTweetsReader() {
    dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
    parser = parser = new XMLParser&lt;Tweets>(
        tweet, published, title, content, contentType, 
        contentText, lang, author, authorName, authorUri
    );
}

@Override
public String getParserName() {
    return "SJXP (pull)";
}

@Override
public Tweets read(InputStream anInputStream) 
throws Exception {
    Tweets _result = new Tweets();  
    parser.parse(anInputStream, "utf-8", _result);
    return _result;
}

I like the idea of SXJP and I think that - particularly on more complex documents - it will lead to code that is easier to understand and maintain because you can consider each part entirely separately. It bulks up with boiler-plate though, especially with that namespace issue I mentioned.

Like SAX and "straight" Pull parsing it also suffers the problem of having to manage intermediate state (in my sample its currentTweet). It does allow a state/context object to be pushed into the callback methods, so I could have passed a customised context class to manage my state in instead of passing Tweets.

dsl4xml (Pull-parser wrapper)

This is my own small wrapper around XMLPullParser. The goals and reasons for it are stated at length else-where, but suffice to say that readability without sacrificing speed was my main aim.

Dsl4xml parsing code has a declarative style, is concise, and uses reflection to cut boiler-plate to a minimum.

Actual code is in Github, here's what it looks like:

private DocumentReader&lt;Tweets> reader;

public Dsl4XmlTweetsReader() {
    reader = mappingOf(Tweets.class).to(
        tag("entry", Tweet.class).with(
            tag("published"),
            tag("title"),
            tag("content", Content.class).with(
                attribute("type"),
                pcdataMappedTo("value")
            ),
            tag("twitter", "lang").
                withPCDataMappedTo("language"),
            tag("author", Author.class).with(
                tag("name"),
                tag("uri")
            )
        )
    );

    reader.registerConverters(
        new ThreadUnsafeDateConverter("yyyy-MM-dd'T'HH:mm:ss")
    );
}

@Override
public String getParserName() {
    return "DSL4XML (pull)";
}

@Override
public Tweets read(InputStream anInputStream) throws Exception {
    return reader.read(anInputStream, "utf-8");
}

There are two things I want to point out, which I guess you will have noticed already:

  1. This is by far the shortest and simplest code of all the samples shown.
  2. The code is slightly unusual in its style because it uses an Internal Domain Specific Language. The nice thing (IMHO) is that it is very readable, and even mimics the structure of the XML itself.

Its still early days for dsl4xml, so the DSL may evolve a bit with time. I'm also looking into ways to keep the same tight syntax without resorting to reflection - the aim being to narrow the performance gap between the raw underlying parser (currently a Pull parser) and dsl4xml.

Performance Comparison

I built some performance tests using the mechanisms described above to parse the same document repeatedly.

The tests are run repeatedly with increasing numbers of threads, from 1 to 8, parsing 1000 documents in each thread. The xml document is read into a byte array in memory before the test starts to eliminate disk IO from consideration.

When the statistics for each method have been collected, the test generates a html document that uses Google charts to render the results.

Each parsing method is tested several times and the results averaged to smooth out some of the wilder outliers (still far from perfect, partly due to garbage collection). I ran the tests on my Linux Desktop, Macbook Air, Samsung Galaxy S2 and Morotola Xoom2 Media Edition.

Here is the chart for the desktop (Core i7 (quad) 1.8GHz, 4GB RAM, Ubuntu 11.10, Sun JDK 1.6.0-26). There is a noticeable hump at 4 threads, presumably because its a quad core. Performance keeps rising up to 8 threads, this presumably because the cpu has hyperthreading. After 8 threads the performance slowly drops off as the context-switching overhead builds up (not shown here):

And here's the chart from my MacBook Air (Core i5 (dual) 1.7GHz, 4GB RAM, OSX Lion, Apple JDK 1.6.0-31):

The difference running under Android is, to put it mildly, astonishing. Here's the chart from my Samsung Galaxy S2 running Android 2.3.4, 64Mb heap. I reduced the max concurrency to 4 and the number of documents parsed per thread to 10, otherwise my phone would be obsolete before the results came back :)

Yep, SAX kicking ass right there.

Here's how it looks on a Motorola Xoom 2 Media edition running Android 3.2.2 (with 48Mb heap):

Confirming that SAX is the way to go on Android!

Quick side note about iOS

My friend Matt Preston did a quick port of the DOM and SAX parsing tests to iOS.

He didn't produce a chart (yet!), but the DOM parsing throughput on an iPhone 4S was approximately twice as good as SAX parsing on my Samsung. SAX Parsing on the iPhone churned through on average 150 docs/sec!

Its interesting to note that the iPhone4S runs a 1GHz Cortex A9 CPU clocked down to 800Mhz, while my Samsung is running a 1.2GHz Cortex A9.

Why XPath parsing sucked so bad

The observant will have noticed the charts do not contain figures for the XPath parsing. That's because I dropped it when I realised it was two orders of magnitude slower even than DOM parsing.

This appalling performance seems to be because when executing each xpath expression a context object is created which involves looking up several files on the classpath (and all the inherent synchronisation this entails). I don't intend to waste my time digging into why this can't done once and cached :(.

If you're interested, this is what my threads spent most of their time doing in the XPath test:

"Thread-11" prio=5 tid=7fcf544d2000 nid=0x10d6bb000 
    waiting for monitor entry [10d6b9000]
    java.lang.Thread.State: BLOCKED (on object monitor)
    at java.util.zip.ZipFile.getEntry(ZipFile.java:159)
    - locked &lt;7f4514c88> (a java.util.jar.JarFile)
    at java.util.jar.JarFile.getEntry(JarFile.java:208)
    at java.util.jar.JarFile.getJarEntry(JarFile.java:191)
    at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:757)
    at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:735)
    at sun.misc.URLClassPath.findResource(URLClassPath.java:146)
    at java.net.URLClassLoader$2.run(URLClassLoader.java:385)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findResource(URLClassLoader.java:382)
    at java.lang.ClassLoader.getResource(ClassLoader.java:1002)
    at java.lang.ClassLoader.getResource(ClassLoader.java:997)
    at java.lang.ClassLoader.getSystemResource(ClassLoader.java:1100)
    at java.lang.ClassLoader.getSystemResourceAsStream(ClassLoader.java:1214)
    at com.sun.org.apache.xml.internal.dtm.SecuritySupport12$6.run
        (SecuritySupport12.java:117)
    at java.security.AccessController.doPrivileged(Native Method)
    at
    com.sun.org.apache.xml.internal.dtm.SecuritySupport12.
        getResourceAsStream(SecuritySupport12.java:112)
    at com.sun.org.apache.xml.internal.dtm.ObjectFactory.
        findJarServiceProviderName(ObjectFactory.java:549)
    at com.sun.org.apache.xml.internal.dtm.ObjectFactory.
        lookUpFactoryClassName(ObjectFactory.java:373)
    at com.sun.org.apache.xml.internal.dtm.ObjectFactory.
        lookUpFactoryClass(ObjectFactory.java:206)
    at com.sun.org.apache.xml.internal.dtm.ObjectFactory.
        createObject(ObjectFactory.java:131)
    at com.sun.org.apache.xml.internal.dtm.ObjectFactory.
        createObject(ObjectFactory.java:101)
    at com.sun.org.apache.xml.internal.dtm.DTMManager.
        newInstance(DTMManager.java:135)
    at com.sun.org.apache.xpath.internal.XPathContext.
        &lt;init>(XPathContext.java:100)
    at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.
        eval(XPathExpressionImpl.java:110)

Conclusions

Readability

Of the mechanisms tested so far, and from the code samples above, I think that dsl4xml produces far the most readable and maintainable parsing code. Of course I am biased.

I think SAX parsing would have worked out to be the most readable of the other mechanisms if it hadn't been for those pesky entity-refs. As it is I have to recommend Pull-parsing as the way to go for readability.

Desktop/laptop xml parsing performance

SAX parsing and the pull-parsing wrappers give comparable performance. Raw Pull-parsing beats the lot by a margin of around 15%. DOM performs relatively badly - around twice as slow as any of the others. Don't go near XPath based parsing unless you like watching paint dry.

Recommendation: Pull Parser for max performance and relative ease of use. Dsl4xml if you want performance and great readability :)

Android xml parsing performance

Avoid XPath at all costs. DOM and pull-parsing appear to have similarly poor performance characteristics. SAX absolutely destroys all the others - roughly an order of magnitude quicker.

Recommendation: SAX, every time. I'll get working on a SAX-based dsl4xml implementation :)

Update (23rd April 2012): Just finished a SAX-based dsl4xml - here's the performance chart for my Samsung Galaxy SII again (also includes figures for SimpleXML):

Final words

The Twitter Atom feed is not particularly complicated - tags are not deeply nested, not too many attributes, no nested tags of the same name, no mixed content (tags and text-nodes as siblings), etc.

I suspect that the performance gap between the different mechanisms widens as the document complexity increases, but as yet have no real evidence to back that up.

blog comments powered by Disqus