Michael loves building software; he's been building search engines for
more than a decade, and has been working on Lucene as a committer, PMC
member and Apache member, for the past few years. He's co-author of
the recently published Lucene in Action, 2nd edition. In his spare
time Michael enjoys building his own computers, writing software to
control his house (mostly in Python), encoding videos and tinkering
with all sorts of other things. Michael is a DZone MVB and is not an employee of DZone and has posted 49 posts at DZone. You can read more from them at their website. View Full User Profile
Primary key lookups are 2.8X faster with MemoryCodec
A few weeks ago I committed the new MemoryCodec to Lucene's trunk (to be 4.0). This codec indexes all terms and postings into a compact finite-state transducer (FST) and then, at search time, avoids I/O by performing all terms and postings enumerations in memory using the FST.
your application needs fast primary-key lookups, and you can afford the
required additional memory, this codec might be a good match for the id field. To test this, I switched Lucene's nightly benchmark to use MemoryCodec (just for its id field), and performance jumped from around 179 K to 509 K lookups per second:
is the performance for a single thread, and should scale up linearly if
you use multiple threads. Each lookup resolves 4,000 keys in order at
once from the id field, performing the lookups segment by segment for best performance (see the source code). The index has 27.6 M docs across multiple segments.
course, there is added memory required, specifically 188 MB for this
index, which works out to 7.1 bytes per document on average.
There are two sources of MemoryCodec's
gains. First, the obvious one: since everything is in memory, you
never wait for an I/O seek operation, as long as you ensure the sneaky
OS never swaps out your process memory.
Second, I separately added a new seekExact API to TermsEnum,
enabling codecs to save CPU if the caller does not need to know the
following term when the target term doesn't exist, as is the case here.
MemoryCodec has an optimized implementation for seekExact (and so does the cool SimpleTextCodec!). Eventually other codecs should as well, by using the block tree terms index, but we're not there yet.
The id field in the nightly benchmark omits term freq and positions, however MemoryCodec
is fully general: you can use it for any field (not just primary-key),
storing positions, payloads, etc. Also, its values are zero-padded
sequential integers (00000001, 00000002, 00000003, etc.), which is
likely important for performance as it allows maximal sharing in the
FST. I haven't tested but I suspect had I used something more random,
such as GUIDs, memory usage would be higher and lookup performance worse as each segment's FST would be less dense (share less).
course, Lucene is not a database, and you normally use it for its fast
search performance, not primary-key lookups. The one common search use
case where you do require primary-key lookups is during indexing, when
deleting or updating documents by an id field. Near-realtime search with updates or deletions relies on this, since the deleted documents must be resolved during reopen, so we also see a healthy speedup in the NRT reopen time:
NRT latencey dropped from around 52 milliseconds to 43 milliseconds, a
17% improvement. This is "only" 17% because opening a new reader must
also do other things like flush the indexed documents as a new segment.
Perhaps more importantly, the variance also dropped substantially, which is expected because with MemoryCodec and NRTCachingDirectory, NRT reopen is fully I/O free (performs no reads or writes when opening a new reader).
One limitation of MemoryCodec is it's an all-or-nothing deal: all terms and postings are in memory, or they aren't. LUCENE-3069,
still to be done (any volunteers?), aims to fix this, by enabling you
to separately choose whether terms and/or postings data should be in
I suspect an even more specialized codec, for example one
that requires the field values to be compact integers, and also
requires that the values are unique (only supports primary-key fields),
could do even better than MemoryCodec by storing the mapping in
global (across all segments) parallel arrays. Such a codec would no
longer be general; it'd only work for primary-key fields whose values
are compact integers. But it'd have faster lookups than MemoryCodec
and should use less memory per document. This codec could simply wrap
any other codec, i.e. it would create the arrays on reader
initialization, and delegate persisting the postings into the index to
the wrapped codec.