Solr-Lucene Zone is brought to you in partnership with:

Yonik Seeley is the creator of Apache Solr and the Chief Open Source Architect and Co-Founder at Lucid Imagination, a company dedicated to development and support of Lucene/Solr. He's also an Apache Lucene/Solr PMC member and committer. Yonik has posted 9 posts at DZone. You can read more from them at their website. View Full User Profile

Advanced Filter Caching in Solr

02.13.2012
| 5627 views |
  • submit to reddit

Advanced Filter Caching is a relatively new feature in Solr, available in version 3.4 and above. It allows precise control over how Solr handles filter queries in order to maximize performance, including the ability to specify if a filter is cached, the order filters are evaluated, and post filtering.

Filter Queries in Solr

Adding a filter expressed as a query to a Solr request is a snap… simply add an additional fq parameter for each filter query.

http://localhost:8983/solr/select?
   q=cars
   &fq=color:red
   &fq=model:Corvette
   &fq=year:[2005 TO *]

By default, Solr resolves all of the filters *before* the main query. Each filter query is looked up individually in Solr’s filterCache (which is pretty advanced itself, supporting concurrent lookups, different eviction policies such as LRU or LFU, and auto-warming). Caching each filter query separately accelerates Solr’s query throughput by greatly improving cache hit rates since many types of filters tend to be reused across different requests.

To Cache or not to Cache

The new advanced filter control API adds the ability to *not* cache a filter. Some filters may see almost no reuse across different requests, and not attempting to cache them can lead to a smaller, more effective filterCache with a higher hit rate.

To tell Solr not to cache a filter, we use the same powerful local params DSL that adds metadata to query parameters and is used to specify different types of query syntaxes and query parsers. For a normal query that does not have any localParam metadata, simply prepend a local param of cache=false. For example:

&fq={!cache=false}year:[2005 TO *]

To add cache=false to a filter query that already had localParams, simply add it right in with the rest of the params. For example, if we want to use Solr’s native spatial abilities to restrict our matches to locations within 50 km of Stanford, our filter query would look like:

&fq={!geofilt sfield=location pt=37.42,-122.17 d=50} 

It’s easy to modify this filter to tell Solr not to cache it by adding cache=false in with the rest of the local parameters:

&fq={!geofilt sfield=location pt=37.42,-122.17 d=50 cache=false} 

Leapfrog anyone?

When a filter isn’t generated up front and cached, it’s executed in parallel with the main query. First, the filter is asked about the first document id that it matches. The query is then asked about the first document that is equal to or greater than that document. The filter is then asked about the first document that is equal to or greater than that. The filter and the query play this game of leapfrog until they land on the same document and it’s declared a match, after which the document is collected and scored.

How much is that filter?

Advanced filtering adds even more fine grained control by introducing the notion of cost. If there are multiple non-cached filters in a response, filters with a lower cost will be checked before those with a higher cost.

&fq={!cache=false cost=10}year:[2005 TO *]
&fq={!geofilt cache=false cost=20}
&pt=37.42,-122.17
&sfield=location
&d=50

In the example above, the filter based on year has a lower cost and will thus always be checked before the spatial filter.

As an aside, notice how spatial queries will use global spatial request parameters if they are not specified locally. This can make it even easier to construct requests containing spatial functions.

Expensive Filters

Some filters are slow enough that you don’t even want to run them in parallel with the query and other filters, even if they are consulted last, since asking them “what is the next doc you match on or after this given doc” is so expensive. For these types of filters, you really want to only ask them “do you match this doc” only after the query and all other filters have been consulted. Solr has special support for this called “post filtering“.

Post filtering is triggered by filters that have a cost>=100 and have explicit support for it. If there are multiple post filters in a single request, they will be ordered by cost.

The frange qparser has post filter support and allows powerful queries specifying ranges over arbitrarily complex function queries.

For example, if we wanted to take the log of popularity, divide it by the square root of the distance, and filter out documents with a result less than 5, we could run this as a post filter using frange:

&fq={!frange l=5 cache=false cost=200}div(log(popularity),sqrt(geodist()))

Post filtering support for the spatial filter queries bbox and geofilt has just been added to Solr 4.0 too. To execute our previous un-cached spatial filter as a post filter, simply modify it’s cost to be greater than 100:

&fq={!geofilt cache=false cost=150}
&pt=37.42,-122.17
&sfield=location
&d=50

Custom Filters

If you have expensive custom logic you’d like to add as a post filter (say per-document custom security ACLs), you can implement your own QParserPlugin that returns Query objects that implement Solr’s PostFilter interface. You can set the default cost or hardcode a cost higher than 100 if you want to only support post filtering. Then, you can use your custom parser as you would any other builtin query type via fq={!myqueryparser} and Solr will handle the rest!

Try it out!

In conclusion, hopefully this gives more insight into just one of many factors working under the hood to make Solr so fast.
To try out the latest functionality, you can always get a nightly build of trunk. Feedback is always appreciated!

Source:  http://www.lucidimagination.com/blog/2012/02/10/advanced-filter-caching-in-solr/

Published at DZone with permission of its author, Yonik Seeley.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Goel Yatendra replied on Thu, 2012/03/15 - 4:08pm

Great post.

I have been using filter queries to avoid calculating scores and performance improved but as I have lots of filters, I was running into OOM error frequently…I used the local parameters to avoid unnecessary caching and the cost value to use the post filtering and it works just fine for my data.

Thank you.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.