Big Data/Analytics Zone is brought to you in partnership with:

Coming from a background of Aerospace Engineering, John soon discovered that his true interest lay at the intersections of information technology and entrepreneurship (and when applicable - math). In early 2011, John stepped away from his day job to take up software consulting. Finally John found permanent employment at Opensource Connections where he currently consults large enterprises about full-text search and Big Data applications. Highlights to this point have included prototyping the future of search with the US Patent and Trademark Office, implementing the search syntax used by patent examiners, and building a Solr search relevancy tuning framework called SolrPanl. John is a DZone MVB and is not an employee of DZone and has posted 23 posts at DZone. You can read more from them at their website. View Full User Profile

Modifying Solr Result Relevancy Via An “Auxiliary Boost” Field

04.12.2013
| 2781 views |
  • submit to reddit

English is a confusing language. I mean, does it really make sense that you can park in a driveway or drive in a parkway? Also, I’ve always been amused that there actually exists a class of words that are their own antonym – so called “auto-antonyms”:

cleave – 1] Split or sever (something) 2] Stick fast to

awful – 1] worthy of awe 2] very bad

to overlook – 1] to inspect 2] to fail to notice

Unfortunately, the confusing nature of English (and of all natural languages) sometimes has consequences that can affect our bottom line. Consider a situation that Zappos! was facing sometime back with their search results: If I am looking for a pair of “dress shoes”, then what should I expect to see?

You would expect that I would see a page full of brown or black leather shoes right? Unfortunately Solr had some different opinions. By default, the page was filled not only with dress shoes, but with sundresses, and tennis shoes, and with dress pants! And in some ways it makes sense, right? Under the hood, Solr is really just a sophisticated and performant token matching engine.

Fortunately for Zappos, much of their problem was alleviated by boosting higher on phrase matches. So that if “dress” and “shoes” occurred next to each other in text, then that document would rise toward the top. However, some e-commerce sites have a great deal of difficulty with this problem and it drives them toward extreme and even somewhat detrimental approaches. For instance, some companies build in special case solutions – bandaid solutions – so that if they see a particular query string then they completely circumvent their search engine and provide a hand tailored set of results. This is a very brittle approach because with every update to the inventory, with every new partnership, and with every new advertising campaign, someone must review each of these bandaid fixes and make sure they are still relevant.

There’s a better approach, beautiful in its simplicity and its flexibility. Solr, and ElasticSearch view each item in your inventory as a document which has various fields which correspondingly have their own values. So for Zappos, a document might contain a SKU, an item name, a brand name, a description, and a price. But there’s no reason that you can’t include additional fields that are used to modify the relevancy of a particular document in a particular search. We call these fields auxiliary boosting fields and they work like this: Consider again the dress shoes problem. If every document in your index has two additional fields, AuxiliaryBoost and AuxiliaryBust, then we can tightly control the search results and the way they are sorted. As a merchandizing expert, if you see a document that should not appear in the search results, a sundress for example, then you add the offending query string to theAuxiliaryBust field. Accordingly, if you find a document that really should be sorted higher in the result set, then you add the query string to theAuxiliaryBoost field. The final piece of this puzzle is a slight modification that you make to the actual query that goes to Solr. To get rid of all bad results you add a filter query to remove those documents that have a match in theAuxiliaryBust field:

fq=-AuxiliaryBust:(dress shoes)

To promote those documents that really deserve to be at the top, you simply add the AuxiliaryBoost field to the set of fields that you’re searching over and apply appropriate boosting.

qf=SKU^10ItemName^5ItemDescription^3Brand^4AuxiliaryBoost^1
pf=ItemDescription^3AuxiliaryBoost^2

Now, if you’re a merchandizing expert reading this, you’re probably becoming upset again at this point because you have no easy way of adding fields or of modifying the text they contain. Furthermore, if you have to adjust boosting of particular fields, your hands are equally tied. We have recognized this issue over and over again and as a result we are in the process of building SolrPanl – a merchandizer-facing search behavior dashboard. As a merchandizer, SolrPanl will allow you to create a test case of “troubled searches” to monitor and modify. If you see a search that has particularly bad results then you will be able adjust the boosting of various fields with a simple UI composed of sliders and selection boxes. As you modify these parameters, you can see immediately how the search results are effected. (In the past, you would have to tell your tech team to make a modification and then check back later to see the results.) If you find that a document appears lower in a particular search result set than it should, then we will provide you the tools to understand why that is happening. Finally, you will also be able to modify the documents directly by adding query strings to fields such as AuxiliaryBoost and AuxiliaryBust. You can even do simple things such as fixing typos!

If you’re interested, then please follow our ongoing development of SolrPanlhere. Also, ask us about becoming a beta tester!

Published at DZone with permission of John Berryman, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)