Heavy Committing: Flexible Indexing in Lucene 4
Apache Lucene's next major release, 4, will introduce lots of flexibility into indexing, but also fundamental changes to the well-known APIs: It features a new and consistent, 4-dimensional iteration API on top of a low-level, pluggable codec API giving applications full control over the postings data. Terms are now arbitrary opaque bytes enabling users to store terms in any encoding, not necessarily UTF-8, natively in the index (e.g. numeric fields). Currently under development is a higher performance postings iteration API, enabling interesting codecs based on recent encoding algorithms to work effectively. Several codecs have already been created, including the default "standard" codec, which enables sizable RAM reduction for searchers, and a "pulsing" codec that inlines postings data directly into the terms dictionary, which provides a solid performance boost for primary key fields. A lot of new codecs are under development. In this talk, Uwe presents an overview of all of these exciting changes, as well as several concrete, real-world examples of how applications can tap into these new features.