Rafal Kuc is a team leader and software developer. Right now he is a software architect and Solr and Lucene specialist. Mainly focused on Java, but open on every tool and programming language that will make the achievement of his goal easier and faster. Rafal is also one of the founders of solr.pl site where he tries to share his knowledge and help people with their problems. Rafał is a DZone MVB and is not an employee of DZone and has posted 75 posts at DZone. You can read more from them at their website. View Full User Profile

The New Spell Checker in Solr 4.0

04.30.2012
| 8466 views |
  • submit to reddit

One of the new features, which will be introduced in Solr 4.0 is a new SpellChecker implementation that doesn’t require its own index. I decided to take a quick look at it and share my thoughts.

What We Have Today

As for today (Solr 3.6) we can use the following SpellChecker implementations:

  • org.apache.solr.spelling.IndexBasedSpellChecker
  • org.apache.solr.spelling.FileBasedSpellChecker

With the upcoming Solr 4.0, we will get a new implementation:

  • org.apache.solr.spelling.DirectSolrSpellChecker


Current Problems

In most of the cases I worked with the main problem of IndexBasedSpellChecker was the need to rebuild its index. In some cases the rebuild was long and it wasn’t possible to rebuild that index after every commit which for some was a bit issue. Of course it wasn’t a problem with FileBasedSpellChecker, but again, in my case, it was used as a support mechanism for the IndexBasedSpellChecker.

Configuration

DirectSolrSpellChecker configuration is similar to the one you are used today in Solr 3. Of course, there are some additional parameters. Following you can find a sample configuration:

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
  <str name="queryAnalyzerFieldType">textTitle</str>
  <lst name="spellchecker">
    <str name="name">default</str>
    <str name="field">title</str>
    <str name="classname">solr.DirectSolrSpellChecker</str>
    <str name="distanceMeasure">internal</str>
    <float name="accuracy">0.7</float>
    <int name="maxEdits">2</int>
    <int name="minPrefix">1</int>
    <int name="maxInspections">5</int>
    <int name="minQueryLength">4</int>
    <float name="maxQueryFrequency">0.01</float>
    <float name="thresholdTokenFrequency">.01</float>
  </lst>
</searchComponent>

And the meaning for each of the parameters:

  • queryAnalyzerFieldType – name of the type on which basis SpellChecker query will be analyzed.
  • field – field which contents will be used to build SpellChecker results.
  • classname – SpellChecker implementation class.
  • distanceMeasure – algorithm which will be used to calculate terms distance, in our case we will use the default ones (Levensthein’s).
  • accuracy – precision that must be achieved for the suggest to be counted as proper one.
  • maxEdits – maximum number of changes during term enumeration. This property can be set to 1 or 2.
  • minPrefix – minimal, common prefix during term enumeration.
  • maxInspections – maximum number of checks for each suggestion.
  • minQueryLength – minimal suggestion length for work to be taken into consideration as proper suggestion.
  • maxQueryFrequency – maximum percentage of documents in which word can appear for the word to be considered as one to correct (0.01 value means 1%).
  • thresholdTokenFrequency -  minimal percentage of documents in which suggestion have to appear in order for it to be considered proper (.01 value means 1%).


The above configuration attributes shows that DirectSolrSpellChecker gives us much degree of behavior configuration.

Usage

DirectSolrSpellChecker is no different than other SpellChecker implementations when it comes to using it. As with the previous implementations you can configure Solr to add SpellChecker results to each query results or just configure new handler and decide when to query it for results. We wrote about how to use SpellChecker in the past – in the “Car sale application” example.

What We Can Expect ?

Acording to the information which we can see at JIRA issue LUCENE-2507 DirectSolrSpellChecker will not only remove the need of having a separate index, but will also improvement in suggestions quality. From what you can see in the mentioned JIRA issue, DirectSolrSpellChecker works better comparing to the previous implementations although it’s slightly slower, but I think that wont be an issue when you don’t use SpellChecker with every query.



Published at DZone with permission of Rafał Kuć, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)