Michael loves building software; he's been building search engines for
more than a decade, and has been working on Lucene as a committer, PMC
member and Apache member, for the past few years. He's co-author of
the recently published Lucene in Action, 2nd edition. In his spare
time Michael enjoys building his own computers, writing software to
control his house (mostly in Python), encoding videos and tinkering
with all sorts of other things. Michael is a DZone MVB and is not an employee of DZone and has posted 49 posts at DZone. You can read more from them at their website. View Full User Profile
Progress! The patch on LUCENE-2792 is now committed to Lucene's trunk (eventually 4.0), so that we now use an FST to hold all terms in RAM for the SimpleText codec.
was a great step forward, and it added the raw low-level infrastructure
for building and using FSTs. But, it's really just a toy usage, since
the SimpleText codec is not for production use.
I'm happy to report even more progress: we finally have a "real" usage for FSTs in Lucene! With LUCENE-2843 (now committed), we use an FST to hold the terms index in RAM.
operations in Lucene require looking up metadata for a requested term.
This information includes the number of documents containing the term,
file pointers into postings files that actually store the docIDs and
positions, etc. Because there could be many unique terms, all this term
metadata resides on-disk in the terms dictionary file (_X.tis).
up a term is then a two-step process. First, we consult the terms
index, which resides entirely in RAM, to map the requested term to a
file pointer in the terms dictionary file. Second, we scan the terms in
the on-disk terms dictionary file starting from that file pointer, to
look for an exact match.
Certain queries, such as TermQuery, need only one exact term lookup. Others, such as FuzzyQuery or the new, very cool automaton spellchecker, which does not require a separate spellchecker index, perform many term lookups (potentially millions).
Before LUCENE-2843, in trunk, the terms index used packed byte and int arrays. In fact, this was already a huge improvement over the terms index in 3.0,
but is still more wasteful than an FST since each term was stored in
fully expanded form, while the FST shares common prefixes and suffixes.
Getting this working also required adding separate seekFloor and seekCeil methods to the FSTEnum classes (like a SortedMap<T>).
To test this, I indexed the first 10 million 1KB documents derived from Wikipedia's English database download.
The resulting RAM required for the FST was ~38% - 52% smaller (larger
segments see more gains, as the FST "scales up" well). Not only is the
RAM required much lower, but term lookups are also faster: the FuzzyQuery united~2 was ~22% faster.
is very much win/win, so we've made this terms index the default for
all core codecs (the terms dictionary and terms index are pluggable, so
its easy for other codecs to use this as well).
There are many
other ways we can use FSTs in Lucene, and we've only just scratched the
surface here. In fact, FSTs offers such a sizable RAM reduction that I
think for many, but not all, apps it'd be realistic to avoid the
two-step term lookup process entirely and simply hold the entire terms
dictionary in RAM. This should make term lookup intensive queries
potentially much faster, though we'd likely have to rework them to use
different algorithms optimized for iterating directly through the terms
as an FST.