Big Data/Analytics Zone is brought to you in partnership with:

Senior Java developer, one of the top stackoverflow users, fluent with Java and Java technology stacks - Spring, JPA, JavaEE. Founder and creator of http://computoser.com and http://welshare.com . Worked on Ericsson projects, Bulgarian e-government projects and large scale recruitment platforms. Member of the jury of the International Olympiad in Linguistics and the Program committee of the North American Computational Linguistics Olympiad. Bozhidar is a DZone MVB and is not an employee of DZone and has posted 91 posts at DZone. You can read more from them at their website. View Full User Profile

A Universal Document Scraper in Scala

09.21.2013
| 4890 views |
  • submit to reddit

As part of a project I’m working on, I needed to get documents from state institutions. Instead of writing code specific to each site, I decided to try creating a “universal” document scraper. It can be found as a separate module within the main project: https://github.com/Glamdring/state-alerts/. The project is written in Scala, and can be used in any JVM project (provided you add a Scala JAR dependency). It is meant for scraping documents, rather than random data. It can probably be extended to do that, but for now I’d like it to be more (state)-document or open-data oriented, rather than a tool for commercial scraping (which is often frowned upon).

It is now in a more or less stable form--I’ve already deployed the application and it works properly--so I’ll just share a short description of the functionality. The point is to be able to specify scraping only via configuration. The class used to configure individual scraping instances is ExtractorDescriptor. There you specify a number of things:

  • The target URL, HTTP method and body parameters (in case of POST). You can put a placeholder {x} which will be used for paging.
  • The type of document (PDF, doc or HTML) and the type of the scraping workflow. In other words, how the document is reached on the target page. There are four options, depending on whether there’s a separate details page, whether there’s only a table and where the link to the document is located.
  • XPath expressions for elements, which contain meta data and the links to the documents. There’s a different expression depending on where the information is located: in a table, or in separate details page.
  • The date format, for the date of the document. Regex can be used if the date cannot be strictly located by XPath.
  • Simple “heuristics." For example, if you know the URL structure of the document you are looking for, there’s no need to locate it via XPath.
  • Other configurations, like javascript requirements, whether scraping should fail on error, and so on.

When you have an ExtractorDescriptor instance ready (for Java apps you can use the builder to create one), you can create a new Extractor(descriptor), and then (usually with a scheduled job) call extractor.extractDocuments(since).

The result is a list of documents (there are two methods – one returns a Scala list, and one returns a Java list).

The library depends on HtmlUnit, NekoHtml, Scala, xml-apis and some others, visible in the POM. It doesn’t support multiple parsers. It also doesn’t handle distributed running of scraping tasks; this you should handle yourself. No JAR release or Maven dependency is published yet; if one needs it, it has to be checked-out and built. I hope it is useful, though. If not as code, then at least as an approach to getting data from web pages programatically.

See more at: http://techblog.bozho.net/?p=1215#sthash.oHy8a4NI.dpuf
Published at DZone with permission of Bozhidar Bozhanov, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)