Michael loves building software; he's been building search engines for more than a decade, and has been working on Lucene as a committer, PMC member and Apache member, for the past few years. He's co-author of the recently published Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things. Michael is a DZone MVB and is not an employee of DZone and has posted 49 posts at DZone. You can read more from them at their website. View Full User Profile

Lucene's FuzzyQuery is 100 times faster in 4.0

  • submit to reddit
There are many exciting improvements in Lucene's eventual 4.0 (trunk) release, but the awesome speedup to FuzzyQuery really stands out, not only from its incredible gains but also because of the amazing behind-the-scenes story of how it all came to be.

FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.

The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).

FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other "close" terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.

Prior to 4.0, FuzzyQuery took the simple yet horribly costly brute force approach: it visits every single unique term in the index, computes the edit distance for it, and accepts the term (and its documents) if the edit distance is low enough.

The journey begins

The long journey began when Robert Muir had the idea of pre-building a Levenshtein Automaton, a deterministic automaton (DFA) that accepts only the terms within edit distance N. Doing this, up front, and then intersecting that automaton with the terms in the index, should give a massive speedup, he reasoned.

At first he built a simple prototype, explicitly unioning the separate DFAs that allow for up to N insertions, deletions and substitutions. But, unfortunately, just building that DFA (let alone then intersecting it with the terms in the index), was too slow.

Fortunately, after some Googling, he discovered a paper, by Klaus Schulz and Stoyan Mihov (now famous among the Lucene/Solr committers!) detailing an efficient algorithm for building the Levenshtein Automaton from a given base term and max edit distance. All he had to do is code it up! It's just software after all. Somehow, he roped Mark Miller, another Lucene/Solr committer, into helping him do this.

Unfortunately, the paper was nearly unintelligible! It's 67 pages, filled with all sorts of equations, Greek symbols, definitions, propositions, lemmas, proofs. It uses scary concepts like Subsumption Triangles, along with beautiful yet still unintelligible diagrams. Really the paper may as well have been written in Latin.

Much coffee and beer was consumed, sometimes simultaneously. Many hours were spent on IRC, staying up all night, with Mark and Robert carrying on long conversations, which none of the rest of us could understand, trying desperately to decode the paper and turn it into Java code. Weeks went by like this and they actually had made some good initial progress, managing to loosely crack the paper to the point where they had a test implementation of the N=1 case, and it seemed to work. But generalizing that to the N=2 case was... daunting.

The breakthrough

Then, finally, a breakthrough! Robert found, after even more Googling, an existence proof, in an unexpected place: an open-source package, Moman, under the generous MIT license. The author, Jean-Phillipe Barrette-LaPierre, had somehow, incredibly, magically, quietly, implemented the algorithm from this paper. And this was apparently a random side project for him, unrelated to his day job. So now we knew it was possible (and we all have deep admiration for Jean-Phillipe!).

We decided to simply re-use Moman's implementation to accomplish our goals. But, it turns out, its source code is all Python (my favorite programming language)! And, nearly as hairy as the paper itself. Nevertheless, we pushed on.

Not really understanding the Python code, and also neither the paper, we desperately tried to write our own Python code to tap into the various functions embedded in Moman's code, to auto-generate Java code containing the necessary tables for each max edit distance case (N=1, N=2, etc.). We had to guess what each Python function did, by its name, trying to roughly match this up to the spooky terminology in the paper.

The result was createLevAutomata.py: it auto-generates crazy looking Java code (see Lev2ParametricDescription.java, and scroll to the cryptic packed tables at the bottom), which in turn is used by further Java code to create the Levenshtein automaton per-query. We only generate the N=1 and N=2 cases (the N>=3 cases aren't really practical, at least not yet).

The last bug...

Realize, now, what a crazy position we were in. We wrote our own scary Python code, tapping into various functions in the Moman package, to auto-generate unreadable Java code with big tables of numbers, which is then used to generate Levenshtein automata from the base term and N. We went through many iterations with this crazy chain of Python and Java code that we barely understood, slowly iterating to get the bugs out.

After fixing many problems, we still had one persistent bug which we just couldn't understand, let alone fix. We struggled for several days, assuming the bug was in our crazy Python/Java chain. Finally, we considered the possibility that the bug was in Moman, and indeed Robert managed to reduce the problem to a tiny Python-only case showing where Moman failed to match the right terms. Robert sent this example to Jean-Phillipe, who quickly confirmed the bug and posted a patch the next day. We applied his patch and suddenly everything was working perfectly!

Fortunately, while this fast FuzzyQuery was unbelievably hairy to implement, testing it well is relatively easy since we can validate it against the brute-force enumeration from 3.0. We have several tests verifying the different layers executed by the full FuzzyQuery. The tests are exhaustive in that they test all structurally different cases possible in the Levenshtein construction, using a binary (only characters 0 and 1) terms.

Beyond just solving this nearly impossible task of efficiently compiling a term to a Levenshtein Automaton, we had many other parts to fill in. For example, Robert separately created a general AutomatonQuery, re-using infrastructure from the open-source Brics automaton package, to enable fast intersection of an automaton against all terms and documents in the index. This query is now used to handle WildcardQuery, RegexpQuery, and FuzzyQuery. It's also useful for custom cases, too; for example it's used by Solr to reverse wildcard queries. These slides from Robert describe AutomatonQuery, and its fun possible use case, in more detail.

Separately, we had an impedance mismatch: these automatons speak full unicode (UTF32) characters, yet Lucene's terms are stored in UTF8 bytes, so we had to create a UTF32 -> UTF8 automaton converter, which by itself was also very hairy! That converter translates any UTF32 automaton into an equivalent UTF8 Levenshtein automaton, which can be directly intersected against the terms in the index.

So, today, when you run a FuzzyQuery in 4.0, it efficiently seeks and scans only those regions of the term space which may have matches, guided by the Levenshtein automaton. This, coupled with ongoing performance improvements to seeking and scanning terms, as well as other major improvements like performing MultiTermQuery rewrites per-segment, has given us the astounding overall gains in FuzzyQuery.

Thanks to these enormous performance improvements, Robert has created an entirely new automaton spell checker that uses this same algorithm to find candidate terms for respelling. This is just like FuzzyQuery, except it doesn't visit the matching documents. This is a big improvement over the existing spellchecker as it does not require a separate spellchecker index be maintained.

This whole exciting experience is a great example of why open-source development works so well. Here we have diverse committers from Lucene/Solr, bringing together their various unusual strengths (automatons, Unicode, Python, etc.) to bear on an insanely hard challenge, leveraging other potent open-source packages including Moman and Brics, iterating with the authors of these packages to resolve bugs. No single person involved in this really understands all of the parts; it's truly a team effort.

And now you know what's going on under the hood when you see incredible speedups with FuzzyQuery in 4.0!

[For the not-faint-of-heart, you can browse LUCENE-1606 to see parts of this story unfolding through Jira]
Published at DZone with permission of Michael Mccandless, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


Mitch Pronschinske replied on Mon, 2011/04/18 - 12:35pm

Looks like the 1.7 release of LucidWorks Enterprise has a certified, tested, packaged, and supported distribution of the Solr/Lucene 4.x trunk.  It's free to download and test so it could be a big help in setting up a stable distribution of 4.0 and trying out FuzzyQuery.

Shoaib Almas replied on Sat, 2012/08/25 - 5:53am

Nice post! Interesting to see this story put to print.

I'd only make one correction to the history line:

We did finally loosely crack the paper during those late night sessions - allowing Robert to do a test impl for n=1. There was still some leaps to be made to generalize those steps for n > 1.

Thankfully you guys came up with the brilliant idea of using Moman for java code generation instead...

Java Forum

Will Martin replied on Wed, 2014/10/08 - 10:03pm

 Mike: " seeks and scans only those regions of the term space which may have matches, guided by the Levenshtein automaton". Seek and Scan an index or indices are technical terms. Can you explain further? How do you avoid touching every term in the index w/o building yet another index. We "skipped" regions in the CPL indexes because the data structure supported metadata about the page(s) in working memory. Does the term space in-memory representation allow the automaton to calculate compute distance to the next possible match?

On the other hand, in CPL the dictionary of terms (the inverted index index) could be easily seek and scanned without touching every entry due to its lexical properties.

I guess the point I'm getting at is whether the Lucene 4.0 implementation terminates each term comparison earlier or actually does not even look at some terms it used to.

btw: FAST ESP had a python production implementation of the same in 2010. The functionality was loved when useful, but I never had a chance to benchmark the performance.

Thanks, Will

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.