Michael loves building software; he's been building search engines for
more than a decade, and has been working on Lucene as a committer, PMC
member and Apache member, for the past few years. He's co-author of
the recently published Lucene in Action, 2nd edition. In his spare
time Michael enjoys building his own computers, writing software to
control his house (mostly in Python), encoding videos and tinkering
with all sorts of other things. Michael is a DZone MVB and is not an employee of DZone and has posted 49 posts at DZone. You can read more from them at their website. View Full User Profile
There are many exciting improvements in Lucene's eventual 4.0 (trunk) release, but the awesome speedup to FuzzyQuery
really stands out, not only from its incredible gains but also because
of the amazing behind-the-scenes story of how it all came to be.
FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.
The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).
FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k)
and a great many other "close" terms. With max edit distance 2 you can
have up to 2 insertions, deletions or substitutions. The score for
each match is based on the edit distance of that term; so an exact match
is scored highest; edit distance 1, lower; etc.
Prior to 4.0, FuzzyQuery
took the simple yet horribly costly brute force approach: it visits
every single unique term in the index, computes the edit distance for
it, and accepts the term (and its documents) if the edit distance is low
The journey begins
The long journey began when Robert Muir had the idea of pre-building a Levenshtein Automaton, a deterministic automaton (DFA) that accepts only the terms within edit distance N.
Doing this, up front, and then intersecting that automaton with the
terms in the index, should give a massive speedup, he reasoned.
At first he built a simple prototype, explicitly unioning the separate DFAs that allow for up to N
insertions, deletions and substitutions. But, unfortunately, just
building that DFA (let alone then intersecting it with the terms in the
index), was too slow.
Fortunately, after some Googling, he discovered a paper,
by Klaus Schulz and Stoyan Mihov (now famous among the Lucene/Solr
committers!) detailing an efficient algorithm for building the
Levenshtein Automaton from a given base term and max edit distance. All
he had to do is code it up! It's just software after all. Somehow, he
roped Mark Miller, another Lucene/Solr committer, into helping him do this.
the paper was nearly unintelligible! It's 67 pages, filled with all
sorts of equations, Greek symbols, definitions, propositions, lemmas,
proofs. It uses scary concepts like Subsumption Triangles, along with
beautiful yet still unintelligible diagrams. Really the paper may as
well have been written in Latin.
Much coffee and beer was
consumed, sometimes simultaneously. Many hours were spent on IRC,
staying up all night, with Mark and Robert carrying on long
conversations, which none of the rest of us could understand, trying
desperately to decode the paper and turn it into Java code. Weeks went
by like this and they actually had made some good initial progress,
managing to loosely crack the paper to the point where they had a test
implementation of the N=1 case, and it seemed to work. But generalizing that to the N=2 case was... daunting.
finally, a breakthrough! Robert found, after even more Googling, an
existence proof, in an unexpected place: an open-source package, Moman, under the generous MIT license. The author, Jean-Phillipe Barrette-LaPierre,
had somehow, incredibly, magically, quietly, implemented the algorithm
from this paper. And this was apparently a random side project for him,
unrelated to his day job. So now we knew it was possible (and we all
have deep admiration for Jean-Phillipe!).
We decided to simply re-use Moman's implementation to accomplish our goals. But, it turns out, its source code is all Python (my favorite programming language)! And, nearly as hairy as the paper itself. Nevertheless, we pushed on.
really understanding the Python code, and also neither the paper, we
desperately tried to write our own Python code to tap into the various
functions embedded in Moman's code, to auto-generate Java code
containing the necessary tables for each max edit distance case (N=1, N=2,
etc.). We had to guess what each Python function did, by its name,
trying to roughly match this up to the spooky terminology in the paper.
The result was createLevAutomata.py: it auto-generates crazy looking Java code (see Lev2ParametricDescription.java, and scroll to the cryptic packed tables at the bottom), which in turn is used by further Java code to create the Levenshtein automaton per-query. We only generate the N=1 and N=2 cases (the N>=3 cases aren't really practical, at least not yet).
The last bug...
now, what a crazy position we were in. We wrote our own scary Python
code, tapping into various functions in the Moman package, to
auto-generate unreadable Java code with big tables of numbers, which is
then used to generate Levenshtein automata from the base term and N.
We went through many iterations with this crazy chain of Python and
Java code that we barely understood, slowly iterating to get the bugs
After fixing many problems, we still had one persistent bug
which we just couldn't understand, let alone fix. We struggled for
several days, assuming the bug was in our crazy Python/Java chain.
Finally, we considered the possibility that the bug was in Moman, and
indeed Robert managed to reduce the problem to a tiny Python-only case
showing where Moman failed to match the right terms. Robert sent this
example to Jean-Phillipe, who quickly confirmed the bug and posted a patch the next day. We applied his patch and suddenly everything was working perfectly!
Fortunately, while this fast FuzzyQuery
was unbelievably hairy to implement, testing it well is relatively easy
since we can validate it against the brute-force enumeration from 3.0. We have several tests verifying the different layers executed by the full FuzzyQuery.
The tests are exhaustive in that they test all structurally different
cases possible in the Levenshtein construction, using a binary (only
characters 0 and 1) terms.
Beyond just solving
this nearly impossible task of efficiently compiling a term to a
Levenshtein Automaton, we had many other parts to fill in. For example,
Robert separately created a general AutomatonQuery, re-using infrastructure from the open-source Brics
automaton package, to enable fast intersection of an automaton against
all terms and documents in the index. This query is now used to handle WildcardQuery, RegexpQuery, and FuzzyQuery. It's also useful for custom cases, too; for example it's used by Solr to reverse wildcard queries. These slides from Robert describe AutomatonQuery, and its fun possible use case, in more detail.
Separately, we had an impedance mismatch: these automatons speak full unicode (UTF32) characters, yet Lucene's terms are stored in UTF8 bytes, so we had to create a UTF32 -> UTF8 automaton converter, which by itself was also very hairy! That converter translates any UTF32 automaton into an equivalent UTF8 Levenshtein automaton, which can be directly intersected against the terms in the index.
So, today, when you run a FuzzyQuery
in 4.0, it efficiently seeks and scans only those regions of the term
space which may have matches, guided by the Levenshtein automaton.
This, coupled with ongoing performance improvements to seeking and
scanning terms, as well as other major improvements like performing MultiTermQuery rewrites per-segment, has given us the astounding overall gains in FuzzyQuery.
Thanks to these enormous performance improvements, Robert has created an entirely new automaton spell checker that uses this same algorithm to find candidate terms for respelling. This is just like FuzzyQuery, except it doesn't visit the matching documents. This is a big improvement over the existing spellchecker as it does not require a separate spellchecker index be maintained.
whole exciting experience is a great example of why open-source
development works so well. Here we have diverse committers from
Lucene/Solr, bringing together their various unusual strengths
(automatons, Unicode, Python, etc.) to bear on an insanely hard
challenge, leveraging other potent open-source packages including Moman
and Brics, iterating with the authors of these packages to resolve bugs.
No single person involved in this really understands all of the parts;
it's truly a team effort.
And now you know what's going on under the hood when you see incredible speedups with FuzzyQuery in 4.0!
[For the not-faint-of-heart, you can browse LUCENE-1606 to see parts of this story unfolding through Jira]