Building a New Lucene Postings Format
A codec is actually a collection of formats, one for each part of the index. For example,
StoredFieldsFormathandles stored fields,
NormsFormathandles norms, etc. There are eight formats in total, and a codec could simply be a new mix of pre-existing formats, or perhaps you create your own
TermVectorsFormatand otherwise use all the formats from the
Lucene40codec, for example.
The trickiest format to create is
PostingsFormat, which provides read/write access to all postings (fields, terms, documents, frequencies, positions, offsets, payloads). Part of the challenge is that it has a large API surface area. But there are also complexities such as skipping, reuse, conditional use of different values in the enumeration (frequencies, positions, payloads, offsets), partial consumption of the enumeration, etc. These challenges unfortunately make it easy for bugs to sneak in, but an awesome way to ferret out all the bugs is to leverage Lucene's extensive randomized tests: run all tests with
(be sure to first register your new postings format). If your new postings format has a bug, tests will most likely fail.
However, when a test does fail, it's a lot of work to dig into the specific failure to understand what went wrong, and some tests are more challenging than others. My favorite is the innocently named TestBasics! Furthermore, it would be nice to develop the postings format iteratively: first get only documents working, then add freqs, positions, payloads, offsets, etc. Yet we have no way to run only the subset of tests that don't require positions, for example. So today you have to code up everything before iterating. Net/net our tests are not a great fit for the early iterations when developing a new postings format.
I recently created a new postings format,
BlockPostingsFormat, which will hopefully be more efficient than the
Sepcodec at using fixed int block encodings. I did this to support Han Jiang's Google Summer of Code project to add a useful int block postings format to Lucene.
So, I took the opportunity to address this problem of easier early-stage iterations while developing a new postings format by creating a new test,
TestPostingsFormat. It has layers of testing (documents, +freqs, +positions, +payloads, +offsets) that you can incrementally enable as you iterate, as well as different test options (skipping or not, reuse or not, stop visiting documents and/or positions early, one or more threads, etc.). When you turn on verbose (
) the test prints clear details of everything it indexed and what exactly it's testing so a failure is easy to debug. I'm very happy with the results: I found this to be a much more productive way to create a new postings format.
The goal of this test is to be so thorough that if it passes with your posting format then all Lucene's tests should pass. If ever we find that's not the case then I consider that a bug in
TestPostingsFormat! (Who tests the tester?)
If you find yourself creating a new postings format I strongly suggest using the new
TestPostingsFormatduring early development to get your postings format off the ground. Once it's passing, run all tests with your new postings format, and if something fails please let us know so we can fix
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)