Grant Ingersoll is a committer on the Apache Lucene and Apache Solr projects, as well as the current Lucene PMC chair. He is also a founding team member of Lucid Imagination. Grant has posted 11 posts at DZone. You can read more from them at their website. View Full User Profile

Search (Business) Rules!

06.25.2012
| 5276 views |
  • submit to reddit

Introduction

One of the most frequent requests we get from customers is they want easier ways to express business needs as part of their search infrastructure. For instance, imagine you’re a large eCommerce site and you want to add facets for the 4 “C’s” of diamond quality anytime someone searched for the word “diamond”, but when a query for “headphones” comes in you want to add facets for plug size, style, manufacturer, etc.

With the LucidWorks Search Platform this used to require your search team to implement an often extensive business layer that was flexible enough to allow you to alter these kinds of things without releasing new code or bouncing the server, etc. For some context, LucidWorks has had some minimal capabilities for expressing changes to relevance based on business needs. For instance, editorial boosting (QueryElevationComponent in Solr parlance) allows one to boost specific results to the top of the result set for given queries or to exclude specific documents. Alternatively, Solr’s Function capabilities (function queries and sort by function) allow one to express fairly complicated expressions for boosting content based on the values of specific fields. Unfortunately, unless your search team exposed these things through their Admin interface, using these capabilities still required a programmer, or at least someone knowledgeable about Solr’s configuration to add the capabilities. In LucidWorks 2.1, we set out to remedy these limitations by putting in place a solution that would allow users to better express business needs in a dynamic, fast paced environment. However, unlike most search engines that take a “not invented here” approach to functionality and spend a lot of effort on building something from scratch (guess who pays for that?), we decided it would be far more useful to harness the capabilities of the existing, very large and quite capable business rules community and their proven mechanisms for capturing dynamic business requirements. To that end, we put in place two pieces of functionality in LucidWorks 2.1:

  1. A framework for integrating 3rd party business rules solutions into the search context, both at indexing time as well as search time. Thus, if you’re company already has a rules engine like IBM’s JRules or Fair Isaac’s Blaze Advisor, you can take advantage of that investment by hooking it into LucidWorks through our framework. Note, you will have to write some code to do this, but it is fairly minimal and we’re happy to help.
  2. A working, fully integrated implementation leveraging Red Hat’s Apache licensed Drools rules engine that allows users to apply business rules across a broad range of LucidWorks functionality, which we’ll detail below. Despite the odd sounding name (which is a hallmark of open source), Drools is a robust, well supported, well documented open source project in wide use.

Quick Drools Primer

After evaluating a number of rules engines, both proprietary and open source, we decided on Drools for a number of reasons:

  1. Easy to use rules language with an associated web-based editor
  2. Implements the Rete algorithm, which is pretty much the standard for this kind of stuff.
  3. 100% Java making for an easy install and easy to integrate
  4. Apache licensed
  5. Cost (i.e. free as in beer)

There are already a lot of tutorials, books and the like on Drools available, so I won’t go into too many details about Drools other than to provide a quick overview plus a few resources. Drools works by having applications inject facts into what is known as the working memory and then evaluating which user written rules should be fired given the facts in the working memory. Rules are essentially if-then clauses which allow rule writers to express what facts must be true (the “if” clause, or “when” clause in Drools lingo) and then what should happen when a fact is true (the “then” clause). A fact in Drools is essentially any Java object that the application wishes to inject. For LucidWorks, facts are things like the input query or the document to be indexed and other objects used to process requests in Solr. As an example of writing rules in Drools that operate on facts, here is “Hello LucidWorks”:

    rule "HelloLucidWorks"

    no-loop

    when

    $rb : ResponseBuilder();

    then $rb.rsp.add("hello", "lucidworks"); end

All this does is check to see if there is an object called ResponseBuilder in the working memory (those familiar with Solr will recognize this as one of the key objects in processing requests in a SearchComponent) and then adds a key-value pair to that ResponseBuilder. Naturally, like all “Hello World” examples, it isn’t all that interesting other than one quickly notices that Drools looks a lot like Java and that this looks like programming, both of which are true on the surface. Drools does look like Java, but it also supports user’s writing rules via either the Guvnor UI (which does not feel as much like programming or Java) or via a Domain Specific Language (DSL) which can be tailored to your particular domain. Rather than turn this into a lengthy Drools tutorial, let’s move along, but not without first refering you to the following resources, after which we’ll try this out in action:

  • www.drools.org - The main place to get started with Drools and find all the latest documentation
  • Drools Books on Amazon. Note there are several good ones, but make sure to get one covering version 5.

LucidWorks + Drools

Instead of going into details on how this is all implemented, let’s try it out by working through a use case whereby we want to add specific terms into a query when we see a certain term. For background, we’ll need to get LucidWorks setup and get some documents into it. To do that, do the following steps:

  1. Download LucidWorks from our downloads page and install it per the install instructions and have it start LucidWorks. I’ll refer to the install location that you choose from here on out as $LW_HOME.
  2. Download the sample Apache email documents from http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout and unpack it (tar -xf ibm.tar.gz). I’ll refer to the location of these files as $CONTENT.
  3. Log in to the LucidWorks admin at http://localhost:8989/ and create a new collection named ASFArchives.
  4. Create a new Filesystem Data source and point it at $CONTENT (i.e. the full path of the directory you unpacked ibm.tar.gz in). I named my data source “Small” and used the defaults for the rest of the options. For more information, see the LucidWorks user guide.
  5. Once you’ve saved the Data Source, kick off the crawl. You can commit the results anytime by browsing to http://localhost:8989/ and hitting the commit button (it will also commit on its own, but if your impatient like me, you can force it).
  6. After its crawled a while, try a search such as “cocoon” (http://localhost:8989/collections/ASFArchives/search?q=cocoon)
  7. We’ll come back to this data later, however, once the crawl is done, you should see roughly 370,000 documents in the collection.

That’s all for setup for now, so let’s switch gears and focus in on writing some rules. In LucidWorks 2.1, our integration requires editing Drools rules file using a text editor. You could also likely use Drools’ Guvnor UI and save the files to the appropriate place, but I haven’t personally tested it. Our default setup comes with a default set of rules files that are hooked into various places inside of LucidWorks. All of these rules files are located in $LW_HOME/conf/solr/cores/asfarchives_1/conf/rules (for each core you have) and have a file suffix of .drl. There should be 4 files in the rules directory, named and described below:

  • defaultDocs.drl — Contains rules that are applied during indexing as part of an update processor in Solr.
  • defaultFirst.drl — Contains rules that are applied during search and faceting requests before other Solr SearchComponents are fired. In other words, it’s the best place to work on the raw request before any results are calculated
  • defaultLast.drl — Contains rules that are applied after other SearchComponents are fired. In other words, it’s the best place to examine the results and make modifications
  • defaultLanding.drl — Contains rules that can be used to short circuit search requests all together.

Note, you can, of course, change the names of these via your configuration, but for now, there is no need.

To get started, let’s open up defaultFirst.drl in an editor. You should see:

# This file contains Lucid’s default rules, as specified in the default solrconfig.xml
    # The default configuration uses this rules file in three places:
    # 1. The Landing Page component, which can be used to short circuit results and just return a landing page
    # 2. The RulesComponent configured to run before all other SearchComponents (there is also
    # one configured to run after all other components, except debug.
    # 3. The RulesDocTransformer, which can be used to alter the fields on a document before it
    # is written out.
    #
    # Rule writers may rely on, when using the RulesComponent, the LandingPageComponent or the RulesDocTransformerFactory, the fact that
    # the name of the “handler” (specified in the configuration) will be available as part of the Request context (request.getContext().get(“rulesHandler”)) along
    # with the phase the component is in (prepare or process — getContext().get(“rulesPhase”)) such that rules can be written that target a specific
    # handler and/or a specific phase.
    #
    #
    package rules;

    #Some common imports
    import org.apache.solr.handler.component.ResponseBuilder;
    import function com.lucid.rules.drools.DroolsHelper.*;
    import org.apache.solr.common.SolrDocument;
    import org.apache.lucene.search.Query;
    import org.apache.solr.common.params.SolrParams;
    import org.apache.solr.common.params.ModifiableSolrParams;

    import org.slf4j.Logger;
    global org.slf4j.Logger logger;

With that out of the way, let’s write a rule.  In this case, I want to force a query of “cocoon” (since we are searching Apache email archives) to only return results that contain both “cocoon” and “compiled” (just for grins, it really isn’t meaningful in a real situation.)  To do this, we need to add a rule to defaultFirst.drl file.   For a reference, run the query now (http://localhost:8888/solr/ASFArchives/lucid?q=cocoon&start=0&rows=10&wt=json&indent=true&rules=false&role=DEFAULT) and examine some of the results.  As the name implies, this rules file gets fired first, before things like query parsing take place so we will be operating on the query String, as opposed to the parsed Query object (for those familiar with Lucene, which we can also do if we want using other rules files.)  Here’s a sample of what the rule might look like:

    rule “cocoon”
    no-loop
    when
    $rb: ResponseBuilder($qStr : req.params.get(“q”).contains(“cocoon”));
    then
    addToResponse($rb, “origQuery”, $qStr);
    modRequest($rb, “q”, “cocoon AND compiled”);
    end

The rule is quite simple. First we tell Drools some things about the rule (name, no-loop) and then the “if” clause. In this case, we want to see if the query parameter to Solr (“q”) equals the word “cocoon”. If it does, then the rule will fire and do two things:

  1. Write the original query to the response as “origQuery” so that our application can know the query was changed
  2. Modify the request by setting the “q” parameter to the new query

The addToResponse and modRequest methods are part of the DroolsHelper import and are provided by LucidWorks. The DroolsHelper class contains a variety of convenience methods for manipulating facts in your rules. These are documented in the LucidWorks documentation. One thing to note: since you are modifying a fact in the system, you have to be careful to not put Drools into an infinite loop due to the fact that it will then reevaluate all rules again causing this rule to fire again. The “no-loop” rule modifier prevents this from happening. Drools also has some other mechanisms for controlling this behavior in more complicated situations, so I encourage you to read the documentation to learn more.

Whew. We’ve got a rule and a little bit of understanding. Now let’s run it. Rules files reloads are triggered by core reloads, so we need to force a core reload. Unfortunately, this isn’t obvious, but it can be done in a couple of ways:

  1. Restart LWE which is not great for a running system, but does work.
  2. Alter an indexing setting such as the soft commit time
  3. By hitting the Solr CoreAdmin. See http://wiki.apache.org/solr/CoreAdmin

Now, try the query from above again, but this time we’ll turn on the rules component: http://localhost:8888/solr/ASFArchives/lucid?q=cocoon&start=0&rows=10&wt=json&indent=true&rules=true&role=DEFAULT in the query by passing in rules=true instead of false. In looking at your results, you should see a few things:

In the response:
"origQuery":"cocoon"

In the responseHeader, you’ll notice the “q” parameter is changed:

"q":"cocoon AND compiled"
  1. Much fewer results and they contain both cocoon and compiled (or variations of compile, since we are stemming)

Voila! Quick and easy and no recompiling, restarting or hassle. Our example here is pretty easy, but hopefully you can start to see the power of this capability as it significantly reduces the cost of deploying business rules for search.

Now, you may be wondering, how’d he know the “q” parameter was in the working memory? That’s because I know all incoming SolrParams are in the working memory (more on how that works later via something called the FactCollector) when dealing with search requests, as are the following, many of which are provided for convenience since many of them can also be accessed via the ResponseBuilder object:

  • ResponseBuilder object
  • The IndexSchema object
  • The SolrRequest object
  • The Context (this is a Map in Solr)
  • The SolrResponse (which is how we were able to write the origQuery to it)
  • All filter queries (fq parameters)
  • All facet counts (assuming facets are being calculated)
  • The Sorting specification (SortSpec)
  • The grouping specification (GroupingSpecification)
  • The results (DocListAndSet)

On the indexing side, the FactCollector has the input command (AddUpdateCommand), the input document and the IndexSchema. The rules transformer has the document to be output, the internal Lucene document id and the IndexSchema. Each of these things can be useful when it comes to writing rules. For now, let’s finish up by looking at some other use cases and ideas around leveraging rules.

Ruling the Land

If you were paying attention, you likely noticed that we ship with a number of “default” rules files. These allow you to insert rules at different parts of requests for both indexing and searching. For instance, we have customers who overlay their taxonomy onto documents during crawling/indexing by looking at the URL that was found and then looking up the appropriate category and adding it as a field on the document based on rules. In other cases, documents are altered dynamically as they come out of LucidWorks by adding or modifying fields using Solr’s document transformer capabilities (If you look in the solrconfig.xml, you’ll see the default configuration is to hook up the RulesTransformerFactory to the “first” rules engine, which uses defaultFirst.drl rules file. See the documentation for more info.) As another handy trick, keep in mind that all passed in parameters are exposed as facts to Drools, while being ignored by the rest of LucidWorks. So, for instance, you can pass in things like user ids, user locations or other business information provided by your application and then leverage them in your rules.

Additionally, there are a few other things you can do to customize your installation to suit your needs. For instance, since most of this is implemented as a SearchComponent, you can place it anywhere in the SearchComponent chain that you want (although first and last are likely the most useful.) You can also provide your own FactCollector. A FactCollector is responsible for injecting, you guessed it, facts into the working memory. As an example, LucidWorks actually ships with two FactCollector implementations. The first is the base FactCollector that we’ve been using so far and which is the default. The second is the StatsFactCollector, which extends FactCollector and brings in system statistics into the working memory based on LucidWorks JMX statistics. Why is this useful? Say you operate a complex application that involves significant spikes in traffic. Now, in most situations, you’d simply want to add hardware, but that isn’t always possible, and so you may want to put some rules in place that only fire when certain failsafes, such as some query load threshold, are triggered. What do these rules do? They simplify requests in order to reduce resource usage. For example, you could:

  • Remove optional facets
  • Simplify queries that are known to be really expensive (such as wildcards.)
  • Turn off spellchecking, highlighting, More Like This or LucidWorks’ pseudo relevance feedback
  • Only sort by score

The goal here is to dynamically reduce the load on your servers while still serving up most of your applications functionality. It’s probably wise to let your users know this, but they may not even notice and it is a whole lot better than failing.

If you implemented your own FactCollector, you might inject facts from Solr in a different way from what we’ve chosen to do or facts from other systems altogether. To load yours, you just need to pass in the class name as part of the solrconfig.xml. If you search in that file for FactCollector, you’ll see an example.

At this point, I think I’ve covered most everything there is to get started as well as given you some food for thought on where to go next. What will you do next to rule search?

Published at DZone with permission of its author, Grant Ingersoll. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)