April 05, 2012

DSL for XML parsing in Android

For a readability and performance comparison of different parsing mechanisms available in Android, have a look at my more recent post that compares parsing a Twitter search result using DOM, SAX, and various Pull parsing methods.

The short story: Always use SAX in Android (here's why).

SAX and Pull-parsing are fast, but don't lead to the most readable/maintainable code. Instead, how about a super-simple internal DSL for describing the mapping for unmarshalling from XML to POJO's, with pull-parsing performance? Quick example:

<books>
  <book>
    <title>The Hobbit</title>
    <synopsis>
        A little guy goes on an adventure, 
        finds ring, comes back.
    </synopsis>
  </book>
  <book>
    <title>The Lord of the Rings</title>
    <synopsis>
        A couple of little guys go on an adventure, 
        lose ring, come back.
    </synopsis>
  </book>
</books>

Can be unmarshalled to simple POJO's with this:

import static com.sjl.dsl4xml.DocumentReader.*;

class BooksReader {
    private DocumentReader&lt;Books> reader;

    public BooksReader() {
        reader = mappingOf(Books.class).to(
           tag("book", Book.class).with(
               tag("title"),
               tag("synopsis")
           )
        );
    }

    public Books reader(Reader aReader) {
        return reader.read(aReader);
    }
}

The long story:

I recently had occasion to work on an Android app that was suffering horrible performance problems on startup (approx 7-12 seconds before displaying content).

A look at the code showed up several possible contenders for the source of the problem:

Many concurrent http requests to fetch XML content from web
Parsing the returned XML documents concurrently
Parsing documents using DOM (and some XPath)

A quick run through with the excellent profiler built in to DDMS immediately showed lots of time spent in DOM methods, and masses of heap being consumed by sparsely populated java.util.List objects (used to represent the DOM in memory).

Since the app was subsequently discarding the parsed Document object, the large heap consumption was contributing a huge garbage-collection load as a side-effect.

Parsing many documents at once meant that the app suffered a perfect storm of exacerbating issues: Slow DOM traversal with XPath; constant thread context-switching; massive heap consumption; and huge object churn.

The network requests - even over 3G - were comparatively insignificant in the grand scheme.

Reducing thread context switching

An obvious and inexpensive thing to try at this point was reducing the concurrency to minimise the overhead of context-switching and hopefully enable the CPU caches to be used to best advantage.

I confess I hoped for a significant improvement from this small change, but the difference, while measurable, was too small to be significant (~5-10%).

More efficient parsing

XPath is easy to use, and typically makes it possible to write straight-forward code for marshalling data from an XML document into Java objects. It is, however, horribly slow and a terrible memory hog.

I decided to try an experiment with an alternative parsing method, to see if a worthwhile performance gain could be achieved on one of the smaller documents that could then be applied to others.

I wrote a small test-case confirming the correctness of the existing parsing mechanism and testing the throughput in documents per second, then extracted an interface and created a new implementation that used Pull-Parsing instead of DOM and XPath.

The result was quite pleasing: 5x faster on a simple document. I fully expected the performance gains to be even better on more complex documents, so was quite eager to repeat the process for one of the most complex documents.

However, I had one major concern that put me off: the code for parsing even a simple document was already quite long and had a nasty whiff of conditional-overkill (think: lots of if statements). I wasn't too happy about trading code readability for performance.

I pondered a few alternatives like XStream which I've used a lot for converting from Java to XML but not much the other way around, and SimpleXML which I have used previously and can be nice, but pollutes your model objects with annotations and in some situations can be a real pain to get working.

An Internal DSL for mapping XML to POJO's

In the end I decided to spend just a few hours turning the problem over in code to see if I could come up with something more readable for working with the pull-parser directly.

The result, after an afternoon of attempting to parse the most complex XML file the app consumed, was a small Internal DSL (Domain Specific Language) for declaratively describing the mapping between an XML and the Java model classes, and a 15x performance improvement in startup time for the app (7-12 seconds down to ~0.5s).

The DSL I originally came up with required some boiler-plate code to do the final mapping between text nodes / attributes and the model classes being populated. If Java had a neat syntax for closures this would have been much less irritating :)

As it was the boiler plate irked me - too much stuff getting in the way of reading what was really important. I thought about it a bit in my spare time, and had another shot at it. My aims were:

To make readable, maintainable, declarative code that unmarshalls XML documents to Java objects.
To make unmarshalling XML documents to Java objects very fast (sax/pull-parsing speeds).
To avoid polluting model classes with metadata about xml parsing (no annotations).
To avoid additional build-time steps or "untouchable" code (code generators, etc).
To produce a very small jar with no large dependencies.

The result is starting to take shape in github as dsl4xml. It removes all of the boiler plate in exchange for a small performance penalty due to use of reflection. I don't have comparative performance figures yet, but will post some when I get time.

Another example

XML:

<hobbit>
  <name firstname="Frodo" surname="Baggins"/>
  <dob>11400930</dob>
  <address>
    <house>
      <name>Bag End</name>
      <number></number>
    </house>
    <street>Bagshot Row</street>
    <town>Hobbiton</town>
    <country>The Shire</country>
  </address>
</hobbit>

POJO's: See the source-code of the test-case

Unmarshalling code:

private static DocumentReader&lt;Hobbit> newReader() {
    DocumentReader&lt;Hobbit> _marshaller = mappingOf(Hobbit.class).to(
        tag("name", Name.class).with(
            attributes("firstname", "surname")
        ),
        tag("dob"),
        tag("address", Address.class).with(
            tag("house", Address.House.class).with(
                tag("name"),
                tag("number")
            ),
            tag("street"),
            tag("town"),
            tag("country")
        )
    );

    _reader.registerConverters(new ThreadUnsafeDateConverter("yyyyMMdd"));

    return _reader;
}

A DocumentReader, once constructed, is intended to be re-used repeatedly. The DocumentReader itself is completely thread-safe as unmarshalling does not modify any of its internal state. To ensure thread-safety you must use only thread-safe type converters (see type conversion section below).

A minimum of garbage is generated because we're using a pull parser to skip over parts of the document we don't care about, and the only state maintained along the way (in a single-use context object for thread safety) is the domain objects we're creating.

Type conversion

You can create and register your own type converters. They are used only to map the lowest level xml data to your Java objects - attribute values and CData Strings. The Converter interface looks like this:

package com.sjl.dsl4xml.support;

public interface Converter&lt;T> {
    public boolean canConvertTo(Class<?> aClass);
    public T convert(String aValue);
}

An example Converter for converting String values to primitive int's looks like this:

class PrimitiveIntConverter implements Converter&lt;Integer> {
    @Override
    public boolean canConvertTo(Class&lt;?> aClass) {
        return aClass.isAssignableFrom(Integer.TYPE);
    }

    @Override
    public Integer convert(String aValue) {
        return ((aValue == null) || ("".equals(aValue))) ? 
            0 : new Integer(aValue);
    }
}

Most converters can be thread-safe, but some may require concurrency control for multi-threaded use (example: when converting dates using SimpleDateFormat).

You can use optimised type converters in situations where you know you will not be unmarshalling from multiple threads concurrently. An example is the ThreadUnsafeDateConverter which is used in the example above because it came from a test-case that will only ever run single-threaded.

public class ThreadUnsafeDateConverter implements Converter&lt;Date> {
    private DateFormat dateFormat;

    public ThreadUnsafeDateConverter(String aDateFormatPattern) {
        // SimpleDateFormat is NOT thread-safe
        dateFormat = new SimpleDateFormat(aDateFormatPattern);
    }

    @Override
    public boolean canConvertTo(Class&lt;?> aClass) {
        return aClass.isAssignableFrom(Date.class);
    }

    @Override
    public Date convert(String aValue) {
        try {
            return ((aValue == null) || ("".equals(aValue))) ? 
                null : dateFormat.parse(aValue);
        } catch (ParseException anExc) {
            throw new XmlMarshallingException(anExc);
        }
    }
}

The alternative ThreadSafeDateConverter looks like this:

class ThreadSafeDateConverter implements Converter&lt;Date> {
    private ThreadLocal&lt;DateFormat> dateFormat;

    public ThreadSafeDateConverter(final String aDateFormatPattern) {
        dateFormat = new ThreadLocal&lt;DateFormat>() {
            protected DateFormat initialValue() {
                return new SimpleDateFormat(aDateFormatPattern);
            }
        };
    }

    @Override
    public boolean canConvertTo(Class&lt;?> aClass) {
        return aClass.isAssignableFrom(Date.class);
    }

    @Override
    public Date convert(String aValue) {
        try {
            return ((aValue == null) || ("".equals(aValue))) ? 
                null : dateFormat.get().parse(aValue);
        } catch (ParseException anExc) {
            throw new XmlMarshallingException(anExc);
        }
    }
}

Missing features

This is still a very new project, and in an experimental stage. There's loads still to do:

Experiment with more documents to drive improvements to the DSL
More converters for the obvious types (e.g., BigDecimal, BigInteger, File, URI, etc.)
Support for namespaced documents
Support for CDATA (so far only tested with PCDATA)
Performance comparisons with DOM, SAX and non-DSL'd Pull parsing
Support for explicit (non-reflective) marshalling of properties
Support for SAX parsing instead of Pull-Parsing (see notes below)
Performance tests
Performance optimisations

Notes

I came across some interesting comments by Diane Hackborn (Android platform developer) in this thread.

Diane points out that SAX parsing is faster than Pull Parsing (at least on Android). I had been under the impression it was the other way around, hence I went with Pull parsing.

Later perf tests show SAX to be much faster on Android, so I will probably refactor to use SAX.

android parser performance

Because I'll forget it if I don't write it down...