Solr-Lucene Zone is brought to you in partnership with:

Chris Hostetter is Senior Staff Engineer at Lucid Imagination, a member of the Apache Software Foundation, and serves as a committer on the Apache Lucene/Solr Projects. Prior to joining Lucid Imagination in 2010 to work full time on Solr development, he spent 11 years as a Principal Software Engineer for CNET Networks thinking about searching “structured data” that was never as structured as it should have been. Chris has posted 15 posts at DZone. You can read more from them at their website. View Full User Profile

Solr Powered ISFDB – Part #10: Tweaking Relevancy

11.27.2011
| 3488 views |
  • submit to reddit
This is Part 10 in a series of 11 (so far) articles by Chris Hostetter in 2011 on Indexing and Searching the ISFDB.org data using Solr.

Circumstances have conspired to keep my away from this series longer then I had intended, So today I want to jump right in talking about improving the user experience by improving relevancy.

(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_9 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_10 tag containing the end result of this article.)

Academic vs Practical

In Academia, people who study IR have historically discussed “relevancy” in terms of “Precision vs Recall” (If these terms aren’t familiar to you, then I highly suggest reading the link) but in my experience, those kinds of metrics are just the starting point. While users tend to care that your “Recall” is good (results shouldn’t be missing), “Precision” is is usually less important then “ordering” — Most users (understandably) want the “best” results to come first, and don’t care about the total number of results.

Defining the “best” results is where things get tricky. Once again, there are lots of great algorithms out there that academics debate the pros and cons of all the time, but frequently the best approach you can take to give you users the “best” results first isn’t to get a PhD in IR, it’s to “cheat” and bias the algorithms and apply “Domain Specific Knowledge” — But I’m getting ahead of myself, let’s start with a real example.

Poor Results in Our Domain

Every domain is different, and the key to providing a good search experience is making sure you really understand your domain, and how your users (and data) relate to it.

Lets look at a specific example with our ISFDB Data. One of the most famous Sci-Fi short stories ever written is Nightfall by Isaac Asimov, who later collaborated with Robert Silverberg to expand it into a novel. If a user (who knows they are searching the ISFDB) searched for the word “Nightfall” they would understandable expect one of those two titles to appear fairly high up in the list of results, but that’s now quite what they get with page #1 of our search as it’s configured right now…


  1. Title: Nightfall INTERIORART – Author: Kolliker
  2. Title: Nightfall: Body Snatchers ANTHOLOGY- Author: uncredited
  3. Title: Cover: Nightfall One COVERART – Author: Ken Sequin
  4. Title: Cover: Nightfall One COVERART – Author: Ken Sequin
  5. Title: Nightfall SHORTFICTION – Author: Tom Chambers
  6. Title: Glossary (Nightfall at Algemron) ESSAY – Author: uncredited
  7. Title: Cover: Nightfall One COVERART – Author: Ken Sequin
  8. Title: Cover: Nightfall Two COVERART – Author: Ken Sequin
  9. Title: Nightfall POEM – Author: Susan A. Manchester
  10. Title: Cover: Nightfall One COVERART – Author: Ken Sequin

These results aren’t terrible surprising since so far in this series we’ve put no work into relevancy tuning, we’re just searching a simple “catchall” field. Before we can improve the situation, it’s important to make sure we understand why we’re getting what we’re getting and why we’re not getting what we want.

Score Explanations

One of the most hard to understand features of Solr is “Score Explanation” — not because it’s hard to use, but because the output really assumes you understand the core underpinnings of Lucene/Solr scoring. When we enable debugging on our query we get a new “toggle explain” links for each result that let us see the score and a break down of how that score was computed — but that doesn’t let us compare with documents that aren’t on page #1. To do that, we use the explainOther option, and switch to the XML view since the velocity templates don’t currently display explainOther info. Now we can compare the explanations between the two docs we really hoped to find, and the top scoring result…

  • TITLE_847094 (Nightfall INTERIORART)
    2.442217 = (MATCH) fieldWeight(catchall:nightfall in 274241), product of:
      <b>1.0</b> = tf(termFreq(catchall:nightfall)=<b>1</b>)
      9.768868 = idf(docFreq=98, maxDocs=636658)
      <b>0.25</b> = fieldNorm(field=catchall, doc=274241)
    
  • TITLE_11852 (Nightfall NOVEL)
    1.7269082 = (MATCH) fieldWeight(catchall:nightfall in 11741), product of:
      <b>1.4142135</b> = tf(termFreq(catchall:nightfall)=<b>2</b>)
      9.768868 = idf(docFreq=98, maxDocs=636658)
      <b>0.125</b> = fieldNorm(field=catchall, doc=11741)
    
  • TITLE_46434 (Nightfall SHORTFICTION)
    1.7269082 = (MATCH) fieldWeight(catchall:nightfall in 41784), product of:
      <b>1.4142135</b> = tf(termFreq(catchall:nightfall)=<b>2</b>)
      9.768868 = idf(docFreq=98, maxDocs=636658)
      <b>0.125</b> = fieldNorm(field=catchall, doc=41784)
    



The devil is in the differences, which I’ve put in bold. Without going into a lot of complicated explanation, the crux of the issue is that even though the documents we’re looking for match the word “nightfall” twice in the catchall field we’re searching (and the top scoring result only matches once) that is offset by the “fieldNorm” which reflects the fact that the catchall field is much longer for our “good” docs then for our “bad” docs.

Tweaking Our Scoring

This is one of those examples where academics doesn’t always match the reality of your domain. Typically when using the TF/IDF scoring model used in Lucene/Solr, you need a “length normalization” factor to offset the common case where a really long document inherently contains more words, so there is a statistical likely hood that the search terms may appear more times. In a nutshell: All other things being equal, shorter is better. This reasoning is generally sound, but the default implementation in Lucene/Solr can be a hinderence in a few common cases:

  • A corpus full of really short documents – our ISFDB index isn’t full books, just a bunch of metadata fields
  • A corpus where longer really is better – in the ISFDB data, more popular titles/authors tend to have more data, which means the catchall field is naturally longer.

There are some cool things we could do with tweaking the Similarity class to try and improve this, but the simplest thing to start with is to omitNorms on the catchall field to eliminate this factor from our scoring. With our new schema, we re-index and see some noticable changes…

  1. Title: Nightfall NOVEL – Authors: Robert Silverberg, Isaac Asimov
  2. Title: Nightfall and Other Stories COLLECTION – Author: Isaac Asimov
  3. Title: Nightfall SHORTFICTION – Author: Isaac Asimov
  4. Title: The Legend of Nightfall NOVEL – Author: Mickey Zucker Reichert
  5. Title: Nightfall NOVEL – Author: John Farris
  6. Title: The Road to Nightfall COLLECTION – Author: Robert Silverberg
  7. Title: Road to Nightfall SHORTFICTION – Author: Robert Silverberg
  8. Title: A Tiger at Nightfall SHORTFICTION – Author: Harlan Ellison
  9. Title: Nightfall SHORTFICTION – Author: David Weber
  10. Title: Nightfall SHORTFICTION – Author: Charles Stross

Domain Specific Biases

Omitting length norms has helped “level the field” for our docs, and in this one example it looks like a huge improvement at first glance, but that’s mainly a fluke. If you look at the score explanations now we get a lot of identical scores, and the final ordering is primarily because of the order they were indexed in.

This is where adding some Domain Specific Bias can be handy. If we review are schema, we see the views and annualviews fields which correspond to how many page views a given author/title has received (recently) on the ISFDB web site. By factoring these page view counts into our scoring, we provide some “Document Biasing” to ensure that documents which are more popular will “win” (ie: score higher) in the event of a tie on the basic relevancy score.

The most straightforward way to bias scoring is with the BoostQParser which will multiple the score of a query for each document against an arbitrary function (on that document). In it’s simplest form we can use it directly in our q param to multiple the scores by the simple sum of the two “views” fields: q={!boost b=sum(views,annualviews)}nightfall and now we get a much more interesting ordering…

  1. Title: Nightfall SHORTFICTION – Author: Isaac Asimov
  2. Title: Nightfall NOVEL – Authors: Robert Silverberg, Isaac Asimov
  3. Title: Nightfall and Other Stories COLLECTION – Author: Isaac Asimov
  4. Title: Nightfall SHORTFICTION – Author: Charles Stross
  5. Title: The Return: Nightfall NOVEL – Author: L. J. Smith
  6. Title: Nightfall SHORTFICTION – Author: Arthur C. Clarke
  7. Title: Nightfall SHORTFICTION – Author: David Weber
  8. Title: The Road to Nightfall COLLECTION – Author: Robert Silverberg
  9. Title: The Legend of Nightfall NOVEL – Author: Mickey Zucker Reichert
  10. Title: Nightfall Revisited ESSAY – Authors: Pat Murphy, Paul Doherty

This new ordering for the page #1 results is much more appropriate for the domain of the ISFDB, and represents a general rule of relevancy biasing: “Users unusually want to see the popular stuff.” However, users don’t usually want to have to type things like {!boost b=sum(views,annualviews)}... into the search box, so we need to encapsulate this into our config. It’s very easy to do this using Local Params, but unfortunately it does mean changing our “main” query param from q to something else.

We start by changing the defaults and invariants of our request handler so that our boost function is always used as the q param, but it uses a new qq param as the main query (whose score will be multiplied by the function). This works fine for our default query, but in order to be useful our UI also needs to be changed to know that the qq param is what is now used for the user input.

Conclusion (For Now)

And that wraps up this latest installment with the blog_10 tag. We’ve dramatically improved the user experience by tweaking our how our relevancy scores are computed based on some knowledge of our domain, particularly via Document Biases. In my next post, I hope to continue the topic of improving the user experience by using DisMax to add “Field Biases”.



Source: http://www.lucidimagination.com/blog/2011/06/20/solr-powered-isfdb-part-10/
Published at DZone with permission of its author, Chris Hostetter.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Amara Amjad replied on Sun, 2012/03/25 - 12:58am

Hi,

Thanks a lot for the interesting articles!

I just finished reading the first article and am very interested in learning how you achieved fetching the data from ISFDB into mysql.

Thanks a lot,

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.