Michael loves building software; he's been building search engines for more than a decade, and has been working on Lucene as a committer, PMC member and Apache member, for the past few years. He's co-author of the recently published Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things. Michael is a DZone MVB and is not an employee of DZone and has posted 49 posts at DZone. You can read more from them at their website. View Full User Profile

Lucene performance with the PForDelta codec

04.22.2011
| 16075 views |
  • submit to reddit
Today, to encode the postings (docs, freqs, positions) in the index, Lucene uses a variable byte format where each integer is individually encoded as 1-5 bytes.

While this is wonderfully simple, it requires an if statement on every byte during decode, which is very costly since the CPU cannot easily predict the branch outcome.

To reduce the number of branches per int decode, you must switch to decoding groups of ints at once. This paper describes a few such encoders (PForDelta, Rice Coding, Simple9, Simple16). Chapter 6 (Index Compression) of Information Retrieval: Implementing and Evaluating Search Engines goes into great detail on these and others, including an addenda showing results for Google's Group VarInt encoder.

The IntBlock codec is designed to make it easy to experiment with these sorts of different int encoders. It does the "hard part" of enabling Lucene to operate on a set of ints at once, which is somewhat tricky as the seek points (in the terms dict and the skipping lists) must now encode both the file-pointer where the block starts as well as the index (starting int) into the block.

There is already a low-level initial implementation Frame-of-reference (FOR) and Patched frame-of-reference (PFOR), on LUCENE-1410, as well as Simple9/16 on LUCENE-2189, thanks to Paul Elschot and Renaud Delbru.

FOR is a simple encoding: it takes each fixed block of ints, ands stores them all as packed ints, where each value gets N bits, set by the maximum int in the block. So say our block size is 8, and we have these ints:

1 7 3 5 6 2 2 5

we'd only need 3 bits per value. But, FOR has a clear weakness: if you have a single big int mixed in with a bunch of small ones, then you waste bits. For example if we have these ints:

1 7 3 5 293 2 2 5

then FOR must use 9 bits for all values (because of 293), despite the fact that the other values only needed 3 bits each. How much this hurts in practice remains to be seen.

PFOR fixes this by marking such large numbers as exceptions, which are then "patched" after the initial decode. So for the above example, we would still use 3 bits per-value, but we'd mark that the 293 value was an "exception" and it would be separately encoded at the end (with more bits), and then "patched" back in at decode time. The above paper has the details.

Note that block-based codecs are not always a good fit, since in order to access even one int within the block, they [typically] must decode the full block. This can be a high cost if you frequently need to decode just an int or two, such as with a primary key field. Pulsing codec is a better fit for such fields.

During testing, I found a few silly performance problems in the IntBlock codec, which I've hacked around for now (I'll post my patch on LUCENE-1410); the hacks are nowhere near committable, but are functioning correctly (identical search results for the queries I'm testing). I also made some more minor optimizations to the FOR/PFOR implementation. I'd like to get the IntBlock codec to the point where anyone can easily tie in a fixed or variable block size int encoding algorithm and run their own tests.

I indexed the first 10 million Wikipedia ~1KB sized docs. The resulting index sizes were:

CodecSize% Change
Standard3.50 GB 
FOR3.96 GB13.3% bigger
PFOR3.87 GB11.1% bigger

The size increase of both FOR and PFOR are fairly low. PFOR is smaller than FOR, as expected, but it's surprising that it's only a little bit lower. Though, this is a function of where you draw the line for the exception cases and will be corpus dependent. Still, I'm encouraged by how little additional space FOR requires over the Standard codec.

I ran a set of queries and measured the best time (of 40 runs) for each, across 4 threads. Here are the results for FOR:

QueryQPS StandardQPS FOR% Change
united~0.65.595.08-9.0%
"united states"11.4611.22-2.1%
united~0.718.3319.234.9%
united states15.5318.6620.1%
uni*d34.2543.3726.6%
unit*31.2741.0731.3%
states55.7275.8236.1%
+united +states15.1721.4341.2%

and for PFOR:

QueryQPS StandardQPS PFOR% Change
united~0.65.595.02-10.1%
"united states"11.4610.88-5.1%
united~0.718.3319.003.7%
united states15.5318.1116.6%
uni*d34.2543.7127.6%
states55.7271.3928.1%
unit*31.2741.4232.5%
+united +states15.1721.2540.1%

They both show good gains for most queries; FOR has a slight edge since it doesn't have to decode exceptions. The united~0.6 (max edit distance = 2) is slower, likely because a good number of the 519 terms it expands to have low frequency, and so the overhead of a full block decode is hurting performance. The PhraseQuery also got a little slower, probably because it uses non-block skipping during searching. The AND query, curiously, got faster; I think this is because I modified its scoring to first try seeking within the block of decoded ints. Likely a hybrid codec for Lucene, using FOR or FFOR for high frequency terms, Standard for medium frequency, and Pulsing for very low frequency, would perform best overall.

I ran one additional test, this time using FOR on MMapDirectory (instead of NIOFSDirectory used above), passing an IntBuffer from the mapped pages directly to the FOR decoder:

QueryQPS StandardQPS FOR% Change
united~0.66.005.28-12.1%
united~0.718.4820.5311.1%
"united states"10.4711.7011.8%
united states15.2120.2533.2%
uni*d33.1647.5943.5%
unit*29.7145.1451.9%
+united +states15.1423.6556.2%
states52.3088.6669.5%

The results are generally even better for two reasons. First, for some reason the Standard codec slows down a bit with MMapDirectory. Second, the FOR codec speeds up substantially, because I'm able (hacked up though) do a no-copy decode with MMapDirectory.

Remember that these results are very preliminary, based on a patch with many non-committable hacks (eg deleted docs are ignored!), that doesn't pass all Lucene's unit tests, etc. There are also still further optimizations to explore, and lots of work remains. So it's not clear where the final numbers will be, but these initial results are very encouraging!

If you have any other algorithms to try for encoding ints, pleaes pipe up! It's very easy to have Lucene generate a few billion ints for you to encode/decode if you want to run standalone tests.
References
Published at DZone with permission of Michael Mccandless, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)