Robert Muir is a software engineer for Lucid Imagination and a Lucene/Solr committer & PMC member. Robert has posted 2 posts at DZone. You can read more from them at their website. View Full User Profile

Flexible ranking in Lucene 4

10.11.2011
| 5844 views |
  • submit to reddit

Over the summer I served as a Google Summer of Code mentor for David Nemeskey, PhD student at Eötvös Loránd University. David proposed to improve Lucene’s scoring architecture and implement some state-of-the-art ranking models with the new framework.

These improvements are now committed to Lucene’s trunk: you can use these models in tandem with all of Lucene’s features (boosts, slops, explanations, etc) and queries (term, phrase, spans, etc). A JIRA issue has been created to make it easy to use these models from Solr’s schema.xml.

Relevance ranking is the heart of the search engine, and I hope the additional models and flexibility will improve the user experience for Lucene: whether you’ve been frustrated with tuning TF/IDF weights and find an alternative model works better for your case, found it difficult to integrate custom logic that your application needs, or just want to experiment.

I’ll be giving a talk about how you can practically apply some of the upcoming Lucene 4 search features at Lucene Eurocon in October.

Some bullet points of the new scoring features:

  • New ranking algorithms, in addition to Lucene’s Vector Space Model:
  • Added key statistics to the index format to support additional scoring models.
    • Term- and field-level statistics for collection frequencies and deriving averages.
    • Additional document-level statistics for computing normalization factors.
  • Decoupled matching from ranking in Lucene’s core search classes:
  • Powerful low-level Similarity API, supporting advanced use cases:
    • Incorporate per-document values from Column Stride Fields into the score.
    • Use different scoring parameters or algorithms for different fields.
    • Fuse multiple scoring algorithms into a combined score.
  • Convenient high-level SimilarityBase for everything else:
    • Write your own scoring function in one Java method.
    • Easy access to available index statistics.


For more information about this GSOC project, take a look at its wiki page

References
Published at DZone with permission of its author, Robert Muir. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)