Geertjan is a DZone Zone Leader and has posted 468 posts at DZone. You can read more from them at their website. View Full User Profile

How do you parse HTML in Java?

  • submit to reddit

The Open Source HTML Parsers in Java page is useful in listing the HTML parsers that are out there. But it doesn't give much of a clue about which are the "best" in a given situation. In other words, how should one decide which HTML parser to use? And, doesn't the proliferation of HTML parsers out there imply that there is something wrong with the JDK's own HTML parser, javax.swing.text.html.HTMLEditorKit.Parser?

All things being equal, shouldn't one prefer to use a utility provided by the JDK over one provided by a third party library? (For this reason, I'm assuming that all things are not equal in this case.) I've been parsing HTML using the JDK's HTML parser, based on the approach described in Parsing HTML with Swing, although that's an article written in 2003, so it may be dated. The author of that article points to this weakness of the Swing HTML parser: "The biggest downside to this HTML parser is that it is not thread safe (thread safety has always been a problem with Swing components). This HTML processor is no different. I have used the Swing parser in heavily threaded environments, and it has resulted in a crash—eventually. If you want to use this HTML processor in a heavily threaded environment, you need to take steps to ensure that only one thread uses it at a time."

Is that the only weakness here? (By the way, on the positive side, the author writes: "I have used this parser with a number of programs that I have written, and I have found it to be very useful. It is particularly helpful for handling improperly formatted HTML, which can trip up some HTML parsers.") I guess the other HTML parsers may have additional features, those that relate to transformation in addition to parsing. And the other parsers probably allow for walking the DOM, rather than inspecting tags in the way that the Swing HTML Parser does. I have used JTidy before, but didn't find the benefits to outweigh the cumbersomeness of having to deal with a third party library.

Anyone care to share their experiences with these utilities?

Published at DZone with permission of its author, Geertjan Wielenga.


Ivan Lazarte replied on Wed, 2008/07/02 - 12:56pm

HTMLCleaner from sourceforge is great.

In one command you can parse a potentially invalid html document into a valid html or xml document.  Has a bunch of configurables that you would need like charset etc.  Highly recommended.

Forms the basis for WebHarvest which is also a great tool for smaller crawl/parse operations


Dan Cristoff replied on Tue, 2011/03/08 - 6:13pm

yes it's very easy to parse invalid docs into valid and it is easy to use and powerfull.Adidasi

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.