Mitch Pronschinske is a Senior Content Analyst at DZone. That means he writes and searches for the finest developer content in the land so that you don't have to. He often eats peanut butter and bananas, likes to make his own ringtones, enjoys card and board games, and is married to an underwear model. Mitch is a DZone Zone Leader and has posted 2569 posts at DZone. You can read more from them at their website. View Full User Profile

Using Lucene and Cascalog for Fast Text Processing at Scale

11.09.2011
| 6867 views |
  • submit to reddit

This post explains text processing and analytics techniques used at the startup Yieldbot.  Their technology uses open source tools including Cascalog, Lucene, Hadoop, and Clojure's Java Interop.  The following post was authored by Soren Macbeth, a Data Scientist at Yieldbot.

Here at Yieldbot we do a lot of text processing of analytics data. In order to accomplish this in a reasonable amount of time, we use Cascalog, a data processing and querying library for Hadoop; written in Clojure. Since Cascalog is Clojure, you can develop and test queries right inside of the Clojure REPL. This allows you to iteratively develop processing workflows with extreme speed. Because Cascalog queries are just Clojure code, you can access everything Clojure has to offer, without having to implement any domain specific APIs or interfaces for custom processing functions. When combined with Clojure's awesome Java Interop, you can do quite complex things very simply and succinctly.

Many great Java libraries already exist for text processing, e.g., Lucene, OpenNLP, LingPipe, Stanford NLP. Using Cascalog allows you take advantage of these existing libraries with very little effort, leading to much shorter development cycles.

By way of example, I will show how easy it is to combine Lucene and Cascalog to do some (simple) text processing. You can find the entire code used in the examples over on Github.  

Our goal is to tokenize a string of text. This is almost always the first step in doing any sort of text processing, so it's a good place to start. For our purposes we'll define a token broadly as a basic unit of language that we'd like to analyze; typically a token is a word. There are many different methods for doing tokenization. Lucene contains many different tokenization routines which I won't cover in any detail here, but you can read the docs ot learn more. We'll be using Lucene's Standard Analyzer, which is a good basic tokenizer. It will lowercase all inputs, remove a basic list of stop words, and is pretty smart about handling punctuation and the like.

First, let's mock up our Cascalog query. Our inputs are going to be 1-tuples of a string that we would like to break into tokens.

(defn tokenize-strings [in-path out-path]
  (let [src (hfs-textline in-path)]
    (?<- (hfs-textline out-path :sinkmode :replace)
         [!line ?token]
         (src !line)
         (tokenize-string !line :> ?token)
         (:distinct false))))

I won't waste a ton of time explaining Cascalog's syntax, since the wiki and docs are already very good at that. What we're doing here is reading in a text file that contains the strings we'd like to tokenize, one string per line. Each one of these string will be passed into the tokenize-string function, which will emit 1 or more 1-tuples; one for each token generated.

Next let's write our tokenize-string function. We'll use a handy feature of Cascalog here called a stateful operation. If looks like this:

(defmapcatop tokenize-string {:stateful true}
  ([] (load-analyzer StandardAnalyzer/STOP_WORDS_SET))
  ([analyzer text]
     (emit-tokens (tokenize-text analyzer text)))
  ([analyzer] nil))

The 0-arity version gets called once per task, at the beginning. We'll use this to instantiate our Lucene analyzer that will be doing our tokenization. The 1+n-arity passes the result of the 0-arity function as it first parameter, plus any other parameters we define. This is where the actual work will happen. The final 1-arity function is used for clean up.

Next, we'll create the rest of the utility functions we need to load the Lucene analyzer, get the tokens and emit them back out.

(defn tokenizer-seq
  "Build a lazy-seq out of a tokenizer with TermAttribute"
  [^TokenStream tokenizer ^TermAttribute term-att]
  (lazy-seq
    (when (.incrementToken tokenizer)
      (cons (.term term-att) (tokenizer-seq tokenizer term-att)))))

(defn load-analyzer [^java.util.Set stopwords]
  (StandardAnalyzer. Version/LUCENE_CURRENT stopwords))

(defn tokenize-text
  "Apply a lucene tokenizer to cleaned text content as a lazy-seq"
  [^StandardAnalyzer analyzer page-text]
  (let [reader (java.io.StringReader. page-text)
        tokenizer (.tokenStream analyzer nil reader)
        term-att (.addAttribute tokenizer TermAttribute)]
    (tokenizer-seq tokenizer term-att)))

(defn emit-tokens [tokens-seq]
  "Compute n-grams of a seq of tokens"
  (partition 1 1 tokens-seq))

We make heavy use of Clojure's awesome Java Interop here to make use of Lucene's Java API to do the heavy lifting. While this example is very simple, you can take this framework and drop in any number of the different Lucene analyzers available to do much more advanced work with little change to the Cascalog code.

By leaning on Lucene, we get battle hardened, speedy processing without having to write a ton of glue code thanks to Clojure. Since Cascalog code is Clojure code, we don't have to spend a ton of time switching back and forth between different build and testing environments and a production deploy is just a `lein uberjar` away.


Source: http://blog.yieldbot.com/using-lucene-and-cascalog-for-fast-text-proce