Rafal Andrzejewski is a team leader and software developer. Taking part in many Solr and Lucene projects. Recently joined solr.pl site as blogger where can share his knowledge about search engine oriented topics. Rafał is a DZone MVB and is not an employee of DZone and has posted 8 posts at DZone. You can read more from them at their website. View Full User Profile

“Car sale application” – solr.ReversedWildcardFilter – let’s optimize wildcard queries (part 8)

10.12.2011
| 3875 views |
  • submit to reddit

“Car sale application” users started to use wildard queries more and more often. This fact forced us to think about wildcard queries optimization. solr.ReversedWildcardFilter comes to rescue us.

solr.ReversedWildcardFilter

The solr.ReversedWildcardFilter filter provides us with new tokens, which in fact are reverses tokens, that are indexed to provide faster leading wildcard queries. The filter supports the following init arguments:

  • withOriginal – if true, then produce both original and reversed tokens at the same positions. If false, then produce only reversed tokens.
  • maxPosAsterisk – maximum position (1-based) of the asterisk wildcard (‘*’) that triggers the reversal of query term. Asterisk that occurs at positions higher than this value will not cause the reversal of query term.
  • maxPosQuestion – maximum position (1-based) of the question mark wildcard (‘?’) that triggers the reversal of query term.
  • maxFractionAsterisk – additional parameter that triggers the reversal if asterisk (‘*’) position is less than this fraction of the query token length.
  • minTrailing – minimum number of trailing characters in query token after the last wildcard character. For good performance this should be set to a value larger than 1.


schema.xml changes

New filter is added to the “text” field type:

<fieldType name="text" class="solr.TextField"
	positionIncrementGap="100">
	<analyzer type="index">
		<tokenizer class="solr.WhitespaceTokenizerFactory" />
		<filter class="solr.PatternReplaceFilterFactory" pattern="'"
			replacement="" replace="all" />
		<filter class="solr.WordDelimiterFilterFactory"
			generateWordParts="1" generateNumberParts="1" catenateWords="1"
			stemEnglishPossessive="0" />
		<filter class="solr.LowerCaseFilterFactory" />
		<strong><filter class="solr.ReversedWildcardFilterFactory" /></strong>
	</analyzer>
	<analyzer type="query">
		<tokenizer class="solr.WhitespaceTokenizerFactory" />
		<filter class="solr.PatternReplaceFilterFactory" pattern="'"
			replacement="" replace="all" />
		<filter class="solr.WordDelimiterFilterFactory"
			generateWordParts="1" generateNumberParts="1" catenateWords="1"
			stemEnglishPossessive="0" />
		<filter class="solr.LowerCaseFilterFactory" />
	</analyzer>
</fieldType>

solr.ReversedWildcardFilterFactory filter is added only to the index analyzer. We do not define any arguments in the filter definition, because we would like to use the default configuration, which is:

  • withOriginal – „true”, we would like to produce original tokens
  • maxPosAsterisk – 2
  • maxPosQuestion – 1
  • maxPosQuestion – 0.0f (disabled)
  • maxPosQuestion – 2


Sample data

Let’s index some sample data:

<add>
  <doc>
    <field name="id">1</field>
    <field name="make">Lancia</field>
    <field name="model">Delta</field>
    ...
  </doc>
  <doc>
    <field name="id">2</field>
    <field name="make">Land Rover</field>
    <field name="model">Defender</field>
    ...
  </doc>
  <doc>
    <field name="id">3</field>
    <field name="make">Acura</field>
    <field name="model">MDX</field>
    ...
  </doc>
  <doc>
    <field name="id">4</field>
    <field name="make">Acura</field>
    <field name="model">RDX</field>
    ...
  </doc>
  <doc>
    <field name="id">5</field>
    <field name="make">Acura</field>
    <field name="model">RSX</field>
    ...
  </doc>
</add>

Let’s create queries

Let me remind you that the default search field is the “content” field, that among others contains “make” and “model” field. To analyse query results and solr.ReversedWildcardFilter filter behaviour, we will set the „stored” argument of the „content” field to “true”. We will also add the debugQuery query argument, which will allow us to find out, which tokens are used in the query processing (original or reversed).

?q=lan*&fl=id,content&debugQuery=on
<result name="response" numFound="2" start="0">
  <doc>
    <arr name="content">
      <str>Lancia</str>
      <str>Delta</str>
      <str>2002</str>
    </arr>
    <str name="id">1</str>
  </doc>
  <doc>
    <arr name="content">
      <str>Land Rover</str>
      <str>Defender</str>
      <str>2002</str>
    </arr>
    <str name="id">2</str>
  </doc>
</result>
<lst name="debug">
  <str name="rawquerystring">lan*</str>
  <str name="querystring">lan*</str>
  <str name="parsedquery">content:lan*</str>
  <str name="parsedquery_toString">content:lan*</str>
  ...
</lst>

We have used asterisk wildcard (‘*’) at the end of the query (position = 4), so the original tokens were used:

<str name="parsedquery">content:lan*</str>

2.   ?q=*dx&fl=id,content&debugQuery=on

<result name="response" numFound="2" start="0">
  <doc>
    <arr name="content">
      <str>Acura</str>
      <str>MDX</str>
      <str>2002</str>
    </arr>
    <str name="id">3</str>
  </doc>
  <doc>
    <arr name="content">
      <str>Acura</str>
      <str>RDX</str>
      <str>2003</str>
    </arr>
    <str name="id">4</str>
  </doc>
</result>
<lst name="debug">
  <str name="rawquerystring">*dx</str>
  <str name="querystring">*dx</str>
  <str name="parsedquery">content:#1;xd*</str>
  <str name="parsedquery_toString">content:#1;xd*</str>
  ...
</lst>

We have used asterisk wildcard (‘*’) at the beginning of the query (position = 1) and additionally we have two trailing characters after the last wildcard. That’s why the revesed tokens were used:

<str name="parsedquery">content:#1;xd*</str>

As we can see, the reversed tokens have a special prefix in order to avoid collisions and false matches.

3.  ?q=r?x&fl=id,content&debugQuery=on
<result name="response" numFound="2" start="0">
  <doc>
    <arr name="content">
      <str>Acura</str>
      <str>RDX</str>
      <str>2003</str>
    </arr>
    <str name="id">4</str>
  </doc>
  <doc>
    <arr name="content">
      <str>Acura</str>
      <str>RSX</str>
      <str>2006</str>
    </arr>
    <str name="id">5</str>
  </doc>
</result>
<lst name="debug">
  <str name="rawquerystring">r?x</str>
  <str name="querystring">r?x</str>
  <str name="parsedquery">content:r?x</str>
  <str name="parsedquery_toString">content:r?x</str>
  ...
</lst>

We have used question mark wildcard (‘?’) on position number 2 and additionally we have only one trailing character after the wildcard. The original tokens were used:

<str name="parsedquery">content:r?x<</str>

The end

Thanks to the solr.ReversedWildcardFilter filter, we have successfully optimized wildcard queries. “Car sale application” users can now effectively use them :)

 

References
Published at DZone with permission of Rafał Andrzejewski, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)