Chris Hostetter is Senior Staff Engineer at Lucid Imagination, a member of the Apache Software Foundation, and serves as a committer on the Apache Lucene/Solr Projects. Prior to joining Lucid Imagination in 2010 to work full time on Solr development, he spent 11 years as a Principal Software Engineer for CNET Networks thinking about searching “structured data” that was never as structured as it should have been. Chris has posted 15 posts at DZone. You can read more from them at their website. View Full User Profile

Solr Powered ISFDB – Part #11: Using DisMax

11.28.2011
| 3405 views |
  • submit to reddit
This is Part 11 in a series of 11 (so far) articles by Chris Hostetter in 2011 on Indexing and Searching the ISFDB.org data using Solr.

When we left off last time, we had used a domain specific biasing function to improve the order of our results so popular Authors and Titles surfaced at the top of results. Today we’re going to look at using DisMax to make further improvements.

(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_10 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_11 tag containing the end result of this article.)

Popular != What I Want

Using a score boost based on popularity gave us some quick wins in making “good” docs bubble up easily, and it’s the type of solution Product Managers and Sales folks really love because it shows the “hot” stuff front and center, but it can also annoy users who are interested in the “long tail”. Sometimes, they may not even be looking for the tip of that tail — take for instance an author search for Sterling.

Bruce Sterling is a popular Sci-Fi author who has published almost 200 novels/stories, and anyone searching the ISFDB Data would be reasonable in expecting his name to be the first result for “Sterling”. Since we’ve got a filter on doc_type:AUTHOR then you would certainly expect him to be at the top of a list of folks named Sterling.

Instead what we get on our page #1 of results is…

  1. Ray Bradbury
  2. Bruce Sterling
  3. Gregory Benford
  4. Edmond Hamilton
  5. Terry Brooks
  6. Sterling E. Lanier
  7. Amy Sterling Casil
  8. William Morrison
  9. Sterling Lanier
  10. Kenneth Sterling

…there’s hardly a “Sterling” among them!

The reason is simple and straight forward, and somewhat clear just from the UI view. We can see that “Ray Bradbury” has a pseudonym of “Brett Sterling” — it’s not a big stretch to imagine that he might be more popular then “Bruce Sterling”, and the explain toggle shows us that that is in fact the case…

  • Ray Bradbury
    451896.44 = (MATCH) boost(catchall:sterling,sum(int(views),int(annualviews))), product of:
      8.268163 = (MATCH) weight(catchall:sterling in 560416), product of:
        0.99999994 = queryWeight(catchall:sterling), product of:
          8.268164 = idf(docFreq=443, maxDocs=636658)
          0.12094583 = queryNorm
        8.268164 = (MATCH) fieldWeight(catchall:sterling in 560416), product of:
          1.0 = tf(termFreq(catchall:sterling)=1)
          8.268164 = idf(docFreq=443, maxDocs=636658)
          1.0 = fieldNorm(field=catchall, doc=560416)
      54655.0 = sum(int(views)=40015,int(annualviews)=14640)
    
  • Bruce Sterling
    327739.88 = (MATCH) boost(catchall:sterling,sum(int(views),int(annualviews))), product of:
      18.488174 = (MATCH) weight(catchall:sterling in 560504), product of:
        0.99999994 = queryWeight(catchall:sterling), product of:
          8.268164 = idf(docFreq=443, maxDocs=636658)
          0.12094583 = queryNorm
        18.488176 = (MATCH) fieldWeight(catchall:sterling in 560504), product of:
          2.236068 = tf(termFreq(catchall:sterling)=5)
          8.268164 = idf(docFreq=443, maxDocs=636658)
          1.0 = fieldNorm(field=catchall, doc=560504)
      17727.0 = sum(int(views)=12092,int(annualviews)=5635)
    



Looking at the other results and their score explanations, it’s easy to see pseudonyms affecting the other results in the same way (or in the case of Terry Brooks: the birth place of “Sterling, Illinois”)

Not All Fields Are Created Equal

It would be easy to fall into a trap of micro-tuning a divisor on the popularity boost to try and make it more subtle, but ultimately the problem is that we are searching against a “catchall” field containing all of the text from all of the other fields, and in reality not all fields are created equal. Bruce Sterling may have the term “Sterling” in his catchall field 5 times compared to Ray Bradbury’s 1, but what should really matter is which fields the term appears in. We could change our catchall field to only include the canonical name of an author instead of their pseudonyms, but that’s a very black/white solution that would hurt folks searching on pseudonyms (or looking for authors from Illinois). What we need is a shade of grey that lets us give more weight to some fields than others

Enter DisMax.

DisMax is a QParser that I’ve written about before. If you want all the gory details, I suggest you read that article, but for now the quick take away is that DisMax let’s you configure different fields to search against with different weights.

To keep things simple for start, I’m going to ignore “Title” documents completely, and focus solely on “Author” docs (since different types of documents contain different fields). Without changing my configs at all, I can use URL params to experiment with some different uses of DisMax to search specific fields with various weightings…




(Note: in this last instance, we have to move the defType=dismax into the q param’s local params, so it will be used to pick the nested parser for v=$qq. defType is only the default type of parser for the “main” query at whatever level it’s used — it doesn’t recurse down to other query strings that get parsed)

We’ve now got some results that look fairly decent: matches in the canonical_name field are heavily weighted and considered really important, but matches anywhere in the document will still be returned as results. In the future we might want to better leverage the pf param of DisMax to only weight fields heavily if they contain all of the terms in a query, but for now we’ve definitely got some incremental improvement.

But What About Titles?

Before we call it day, we have to think about the “Title” situation. We’re still searching the catchall field, so matching titles are still be returned, but since they don’t have a chance of matching any of the heavily weighted fields, the scores from DisMax can be so low that even extremely popular titles will score lower then authors who just happen to have names that are similar to their titles. I’m sure Pete Lion worked very hard on the cover art for the one book he worked on, but does it really make sense that a search for lion should return him before The Lion, the Witch and the Wardrobe? (The most popular title in the ISFDB).

One approach we could take would be to use copyField directives or DIH transformers to create more “common” fields that would exist in all types of documents, and use those in our DisMax options. I may do that down the road, but in the mean time we can gain parity for Title documents by adding the title field to the qf with a comparable boost to canonical_name So “good” matches on Title docs will get decent scores.

Last But Not Least: Fix Some UI Bugs

When I added the multiplicative boost last week, and switched to using qq as the main query param, I configured the boosted q param as an “invariant” so that it would always be applied and could never be overridden. This works well, and I updated the text-box in the UI to know about qq but I forgot to do something about the “See All Titles”, “Pseudonyms: “, “Real Name: ” and “Author: ” links that are created in the results. They still try to specify a q param, which gets ignored.

Since I’m using DisMax as my QParser for the qq param, just doing a search and replace of “q=” to “qq=” won’t really work here (it probably could if I used “edismax”, but that’s a topic for a different day). Instead what I’m going to do is what I probably should have done in the first place when I added these links: Use fq for these links, and rely on the default query (now specified for DisMax using q.alt)

The one other “Bug” I wanted to fix today is the bug in my brain that somehow let me get this far in working on this project without ever adding an external link from each search result to the main ISFDB.org page for the specific Author/Title. I’m not sure why I never did it before, but it was a relatively simple little but of UI markup (although it did require a small macro change because of some oddities in whitespace handling).

Conclusion (For Now)

And that wraps up this latest installment with the blog_11 tag. We’ve now got some much better looking results for various searches, by using DisMax to search against various fields with different weighted importance.

One final note: It’s important to realize that there is nothing special about the weights I picked for these fields. They are not magic numbers, I did not put a lot of thought into them, and I didn’t rely on any particular wisdom or experience (that I didn’t share in this article) to decide what they should be. I just picked numbers that at first glance gave me good looking results. The scores produced don’t matter — what matters is that the weights used in the qf “play nicely” with one each other, and with the multiplicative boost from the popularity. If the popularity numbers grow by a few orders of magnitude, then these numbers might not be useful anymore. In an ideal world, I would setup a suite of relevancy tests, and do click through analysis, and have a team of helper monkeys sanity checking popular searches — but for a one man personal project, the results so far seem pretty good.


Source: http://www.lucidimagination.com/blog/2011/08/08/solr-powered-isfdb-part-11/
Published at DZone with permission of its author, Chris Hostetter.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: