Andreas Haufler is the software architect and one of the founders of the scireum GmbH, loacted near Suttgart (Germany). He holds a Diploma in Computer Science (Softwareengineering) from the University of Stuttgart. He has over ten years of experience with Java, HTML and related web technologies. Being a "language hobbyist" he is also interested in enjoying the power of less popular but definitely more beautiful lanuages like Smalltalk and Lisp(s). Andreas has posted 6 posts at DZone. View Full User Profile

Conveniently Processing Large XML Files with Java

01.10.2012
| 20870 views |
  • submit to reddit

When processing XML data it's usually most convenient to load the whole document using a DOM parser and fire some XPath-queries against the result. However, since we're building a multi-tenant eCommerce plattform we regularly have to handle large XML files, with file sizes above 1 GB. You certainly don't want to load such a beast into the heap of a production server, since it easily grows up to 3GB+ as DOM representation.

So what to do? Well, SAX to the rescue! Processing a large XML file using a SAX parser still requires constant (low) memory, since it only invokes callback for detected XML tokens. But, on the other hand, parsing complex XML really becomes a mess.

To resolve this problem we need to have a closer look at our XML input data. Most of the time, at least in our cases, you don't need the whole DOM at once. Say your importing product informations, it sufficient to look at one product at a time. Example:

<nodes>
    <node>
        <name>Node 1</name>
        <price>100</price>
    </node>
    <node>
        <name>Node 2</name>
        <price>23</price>
    </node>
    <node>
        <name>Node 3</name>
        <price>12.4</price>
        <resources>
            <resource type="test">Hello 1</resource>
            <resource type="test1">Hello 2</resource>
        </resources>
    </node>
</nodes>

 

 When processing Node 1, we don't need access to any attribute of Node 2 or three, respectively when processing Node 2, we don't need access to Node 1 or 3, and so on. So what we want is a partial DOM, in our example for every <node>.


What we've therefore built is a SAX parser, for which you can specify in which XML elements you are interested. Once such an element starts, we record the whole sub-tree. When this completes we notify a handler which then can run XPath expressions against this partial DOM. After that, the DOM is released and the SAX parser continues.

Here is a shortened example of how you could parse the XML above - one "<node>" at a time:
XMLReader r = new XMLReader();
   r.addHandler("node", new NodeHandler() {

     @Override
     public void process(StructuredNode node) {

       System.out.println(node.queryString("name"));
       System.out.println(node.queryValue("price").asDouble(0d));
     }
   });

   r.parse(new FileInputStream("src/examples/test.xml"));

The full example, along with the implementation is open source (MIT-License) and available here:
https://github.com/andyHa/scireumOpen/tree/master/src/com/scireum/open/xml
https://github.com/andyHa/scireumOpen/blob/master/src/examples/ExampleXML.java

We successfully handle up to five parallel imports of 1GB+ XML files in our production system, without measurable heap growth. (Instead of using a FileInputStream, we use JAVAs ZIP capabilities and directly open and process ZIP versions of the XML file. This shrinks those monsters down to 20-50MB and makes uploads etc. much easier.)
Published at DZone with permission of its author, Andreas Haufler.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Tom Eugelink replied on Tue, 2012/01/10 - 9:39am

This is not a bad idea. To prevent simple mistakes it may be wise to also allow for specifying an absolute path leading up to the node, but I really like the idea. r.addHandler("/root/some/path/to/the/node", new NodeHandler() {

Greg Brown replied on Tue, 2012/01/10 - 10:04am

Interesting idea. By the way, if you like SAX, I'd recommend taking a look at StAX. It has all of the benefits of SAX but (IMO) is much easier to work with:

http://en.wikipedia.org/wiki/StAX

 

Andreas Haufler replied on Tue, 2012/01/10 - 4:08pm

Greg, thanks for you feedback. I already read about StAX but didn't have a detailed look at it. Does anybody know if there's a performance benefit from using StAX over SAX, or is the API "just" simpler/better?

Tom, good idea. We use this mainly to import article descriptions into an e-commerce system. These files are mainly long lists of categories and articles with more or less complex XML inside each article. Therefore the node name was good enough for us. Still for other use cases this might be a good improvement. Also our approach currently ignores XML namespaces, since there was more overhead than benefit from implementing that.

regards
Andy

Arnaud Des_vosges replied on Thu, 2012/01/12 - 11:54am

See SAXDOMIX library

Manuel Eveno replied on Tue, 2012/03/27 - 2:43am

As you ask about performance comparison, I found this article (referenced in Java Performance Tuning newsletter, javaperformancetuning.com) : 

http://dublintech.blogspot.fr/2011/12/jaxb-sax-dom-performance.html

 

Alex Giotis replied on Wed, 2012/03/28 - 5:13am

Dom4j offers a similar implementation that I have used it successfully in many projects. See http://dom4j.sourceforge.net/dom4j-1.6.1/faq.html#large-doc

Manoj Babu replied on Wed, 2012/08/15 - 8:02am

Hi Andy,

 As you mentioned we can specify in which XML elements we are interested in the hanlder and Once such an element starts, it record the whole sub-tree. Consider if i  am interested in multiple elements what should i do?

Should i add multiple handler? will it effective in performance? 

Lolke Dijkstra replied on Tue, 2012/11/20 - 7:48am

Yup. It is an interesting approach. I (also) recently published an article here that deals with processing big XML data in Java: 

http://java.dzone.com/articles/framework-and-code-generator

You may want to have a look at it. It also utilizes SAX, but uses code generation to generate the JavaBeans access to the schema complexTypes. It deals with large datasets by allowing the application programmer to configure what parts to process at runtime. It also deals with memory issues like containers. (e.g. the <nodes> element above).

We've got an evaluation version available for anyone who is interested in checking it out.


Michael Onoprienko replied on Fri, 2012/11/30 - 7:05pm

I've recently written exactly the same kind of XML parser. The difference is in ability to use various DOM libraries (native, jdom, dom4j, xom). Take a look, share your thoughts:

https://github.com/nanotears/solna-xml

Lolke Dijkstra replied on Sat, 2012/12/01 - 5:04am

I like this approach. It's elegant and straight forward. However, as you said, it does not deal with mapping to POJO and is slower due to the DOM tree construction. Nevertheless it can be a quite valid approach for not too complex XML documents. I would certainly appreciate if you could take a moment and have a look at: http://dijkstra-ict.nl/xml-java-data-mapping-big-data-article.html. Any thoughts/comments are very much appreciated.

You can also have a look here: http://java.dzone.com/articles/framework-and-code-generator

It's almost identical.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.