Hacking on GraphHopper - a Java road routing engine. Peter has posted 62 posts at DZone. You can read more from them at their website. View Full User Profile

Memory Efficient XML Processing not only with DOM

05.01.2010
| 8264 views |
  • submit to reddit

How can I efficiently parse large xml files which can be several GB large? With SAX? Hmmh, well, yes: you can! But this is somewhat ugly. If you prefer a better maintable approach you should definitely try joost which does not load the entire xml file into memory but is quite similar to xslt.

But how can I do this with DOM or even better dom4j, if you only have 50 MB or even less RAM? Well, this is not always possible, but under some circumstances you can do this with a small helper class. Read on!

E.g.you have the xml file

<products>
<product id="1"> CONTENT1 .. </product>
<product id="2"> CONTENT2 .. </product>
<product id="3"> CONTENT3 .. </product>
...
</products>

Then you can parse it product by product via:

List<String> idList = new ArrayList<String>();
ContentHandler productHandler =
new GenericXDOMHandler("/products/product") {
public void writeDocument(String localName, Element element)
throws Exception {
// use DOM here
String id = element.getAttribute("id");
idList.add(id)
}
}
GenericXDOMHandler.execute(new File(inputFile), productHandler);

How does this work? Every time the SAX handler detects the <product> element it will read the product tree (which is quite small) into RAM and call the writeDocument function. Technically we have added a listener to all the product elements with that and are waiting for ‘events’ from our GenericXDOMHandler. The code was developed for my xvantage project but is also used in production code on big files:

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Attr;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

/**
* License: http://en.wikipedia.org/wiki/Public_domain
* This software comes without WARRANTY about anything! Use it at your own risk!
*
* Reads an xml via sax and creates an Element object per document.
*
* @author Peter Karich, peathal 'at' yahoo 'dot' de
*/
public abstract class GenericXDOMHandler extends DefaultHandler {

private Document factory;
private Element current;
private List<String> rootPath;
private int depth = 0;

public GenericXDOMHandler(String forEachDocument) {
rootPath = new ArrayList<String>();
for (String str : forEachDocument.split("/")) {
str = str.trim();
if (str.length() > 0)
rootPath.add(str);
}

if (rootPath.size() < 2)
throw new UnsupportedOperationException("forEachDocument"+
+" must have at least one sub element in it."
+ "E.g. /root/subPath but it was:" + rootPath);
}

@Override
public void startDocument() throws SAXException {
try {
factory = DocumentBuilderFactory.newInstance().
newDocumentBuilder().newDocument();
} catch (Exception e) {
throw new RuntimeException("can't get DOM factory", e);
}
}

@Override
public void startElement(String uri, String local,
String qName, Attributes attrs) throws SAXException {

// go further only if we add something to our sub tree (defined by rootPath)
if (depth + 1 < rootPath.size()) {
current = null;
if (rootPath.get(depth).equals(local))
depth++;

return;
} else if (depth + 1 == rootPath.size()) {
if (!rootPath.get(depth).equals(local))
return;
}

if (current == null) {
// start a new subtree
current = factory.createElement(local);
} else {
Element childElement = factory.createElement(local);
current.appendChild(childElement);
current = childElement;
}

depth++;

// Add every attribute.
for (int i = 0; i < attrs.getLength(); ++i) {
String nsUri = attrs.getURI(i);
String qname = attrs.getQName(i);
String value = attrs.getValue(i);
Attr attr = factory.createAttributeNS(nsUri, qname);
attr.setValue(value);
current.setAttributeNodeNS(attr);
}
}

@Override
public void endElement(String uri, String localName,
String qName) throws SAXException {

if (current == null)
return;

Node parent = current.getParentNode();

// leaf of subtree
if (parent == null)
current.normalize();

if (depth == rootPath.size()) {
try {
writeDocument(localName, current);
} catch (Exception ex) {
throw new RuntimeException("Exception"+
+" while writing one element of path:" + rootPath, ex);
}
}

// climb up one level
current = (Element) parent;
depth--;
}

@Override
public void characters(char buf[], int offset, int length)
throws SAXException {
if (current != null)
current.appendChild(factory.createTextNode(
new String(buf, offset, length)));
}

public abstract void writeDocument(String localName, Element element)
throws Exception {
}

public static void execute(File inputFile,
ContentHandler handler)
throws SAXException, FileNotFoundException, IOException {

execute(new FileInputStream(inputFile), handler);
}

public static void execute(InputStream input,
ContentHandler handler)
throws SAXException, FileNotFoundException, IOException {

XMLReader xr = XMLReaderFactory.createXMLReader();
xr.setContentHandler(handler);
InputSource iSource = new InputSource(new InputStreamReader(input, "UTF-8"));
xr.parse(iSource);
}
}

PS: It should be simple to adapt this class to your needs; e.g. using dom4j instead of DOM. You could even register several paths and not only one rootPath via a BindingTree. For an implementation of this look at my xvantage project .

PPS: If you want to process xpath expressions in the writeDocument method be sure that this is not a performance bottleneck with the ordinary xpath engine! Because the method could be called several times. In my case I had several thousand documents, but jaxen solved this problem!

PPPS: If you want to handle xml writing and reading (‘xml serialization’) from Java classes check this list out!

 

From http://karussell.wordpress.com/2010/04/29/memory-efficient-xml-processing-not-only-with-dom/

Published at DZone with permission of its author, Peter Karussell.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Alois Cochard replied on Sat, 2010/05/01 - 7:20am

In the project where I am actually working we need to process really huge xml files. But we are using StAX (with XmlBeans) that is really build for the purpose of streaming and I find this solution smarter than this one... Take a look at this article: http://www.devx.com/xml/Article/34037 Using this techniques you can keep binding while using streaming :) Interested to have your advice on the technique we use ...

Andy Leung replied on Mon, 2010/05/03 - 7:18am

@Alois I think the reason behind is that this article assumed the hierarchy under "Product" is small, at least that's what I feel.

I had worked on a project that needed to read 4GB of XML. I didn't use this method but similar. I even made it multi-threaded because it was for read only. I planned to make it to use NIO so multi-threaded would be more efficient but I left that company when I learned about NIO by that time.

Peter Karussell replied on Mon, 2010/05/03 - 8:58am

> But we are using StAX (with XmlBeans)

I am choosing the tools depending on the requirements ;-)

The parsing was implemented in 10 minutes and no external libs are necessary.

BTW: I prefer the simple library over XmlBeans. Or what are the advantages of XmlBeans?

> the hierarchy under "Product" is small, at least that's what I feel.

exactly

Andrew Crawford replied on Tue, 2010/05/04 - 9:10pm

I've used Apache Digester before for 1-2GB files, and it seems to work fine without me having to write my own custom SAX Handler. With digester, you can set up your own rules using XPath and get it to populate / create objects when they are encountered. Unless I'm missing something, I can't really see the benefits of what you are doing over using Digester.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.