Lucene's In-memory Terms Dictionary, Thanks to Google Summer of Code
Jiang's Google Summer of Code project
big success: he created a new (now, default) postings format
for substantially faster searches, along with smaller indices.
This summer, Han was at it again, with a new Google Summer of Code project with Lucene: he created a new terms dictionary holding all terms and their metadata in memory as an FST.
In fact, he created two new terms dictionary implementations. The first,
FSTTermsWriter/Reader, hold all terms and metadata
in a single in-memory FST, while the second,
FSTOrdTermsWriter/Reader, does the same but also supports
retrieving the ordinal for a term (
looking up a term given its ordinal (
ord)). The second one also uses this
internally so that the FST is more compact, while all metadata is
stored outside of the FST, referenced by
Like the default
BlockTree terms dictionary, these new
terms dictionaries accept any
PostingsBaseFormat so you
can separately plug in whichever format you want to encode/decode the
Han also improved the
PostingsBaseFormat API so that
there is now a cleaner separation of how terms and their metadata are
encoded, as opposed to how postings are encoded;
PostingsReaderBase.decodeTerm now handle encoding and
decoding any term metadata required by the postings format,
abstracting away how the long/byte were persisted by the terms
dictionary. Previously, this line was annoyingly blurry.
Unfortunately, while the performance for primary key lookups is substantially faster, other queries like
WildcardQuery are slower; see
for details. Fortunately, using
you are free to pick and choose which fields (e.g., your "id" field)
should use the new terms dictionary.
For now, this feature is trunk-only (eventually Lucene 5.0).
Thank you, Han, and thank you, Google!
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)