Michael loves building software; he's been building search engines for more than a decade, and has been working on Lucene as a committer, PMC member and Apache member, for the past few years. He's co-author of the recently published Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things. Michael is a DZone MVB and is not an employee of DZone and has posted 49 posts at DZone. You can read more from them at their website. View Full User Profile

Finite State Transducers, Part 2

  • submit to reddit
In my last post, I described a cool incremental algorithm for building an FST from pre-sorted input/output pairs, and how we're folding it into Lucene.

Progress! The patch on LUCENE-2792 is now committed to Lucene's trunk (eventually 4.0), so that we now use an FST to hold all terms in RAM for the SimpleText codec.

This was a great step forward, and it added the raw low-level infrastructure for building and using FSTs. But, it's really just a toy usage, since the SimpleText codec is not for production use.

I'm happy to report even more progress: we finally have a "real" usage for FSTs in Lucene! With LUCENE-2843 (now committed), we use an FST to hold the terms index in RAM.

Many operations in Lucene require looking up metadata for a requested term. This information includes the number of documents containing the term, file pointers into postings files that actually store the docIDs and positions, etc. Because there could be many unique terms, all this term metadata resides on-disk in the terms dictionary file (_X.tis).

Looking up a term is then a two-step process. First, we consult the terms index, which resides entirely in RAM, to map the requested term to a file pointer in the terms dictionary file. Second, we scan the terms in the on-disk terms dictionary file starting from that file pointer, to look for an exact match.

Certain queries, such as TermQuery, need only one exact term lookup. Others, such as FuzzyQuery or the new, very cool automaton spellchecker, which does not require a separate spellchecker index, perform many term lookups (potentially millions).

Before LUCENE-2843, in trunk, the terms index used packed byte[] and int[] arrays. In fact, this was already a huge improvement over the terms index in 3.0, but is still more wasteful than an FST since each term was stored in fully expanded form, while the FST shares common prefixes and suffixes. Getting this working also required adding separate seekFloor and seekCeil methods to the FSTEnum classes (like a SortedMap<T>).

To test this, I indexed the first 10 million 1KB documents derived from Wikipedia's English database download. The resulting RAM required for the FST was ~38% - 52% smaller (larger segments see more gains, as the FST "scales up" well). Not only is the RAM required much lower, but term lookups are also faster: the FuzzyQuery united~2 was ~22% faster.

This is very much win/win, so we've made this terms index the default for all core codecs (the terms dictionary and terms index are pluggable, so its easy for other codecs to use this as well).

There are many other ways we can use FSTs in Lucene, and we've only just scratched the surface here. In fact, FSTs offers such a sizable RAM reduction that I think for many, but not all, apps it'd be realistic to avoid the two-step term lookup process entirely and simply hold the entire terms dictionary in RAM. This should make term lookup intensive queries potentially much faster, though we'd likely have to rework them to use different algorithms optimized for iterating directly through the terms as an FST.
Published at DZone with permission of Michael Mccandless, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)



Kathy John replied on Tue, 2012/02/21 - 1:35pm

It's very interesting. Is there some way to use your FSM implementation without Lucene itself? I'm interested, because I am writing dictionary based lemmatizer for russian language. Stemmers works not so well for russian, because it's very complicated language with very rich flexion model. And so I need some memory efficient data structure which allows me to map char sequences to their ordinal lemma number. I think FST would help me a lot.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.