Rafal Kuc is a team leader and software developer. Right now he is a software architect and Solr and Lucene specialist. Mainly focused on Java, but open on every tool and programming language that will make the achievement of his goal easier and faster. Rafal is also one of the founders of solr.pl site where he tries to share his knowledge and help people with their problems. Rafał is a DZone MVB and is not an employee of DZone and has posted 75 posts at DZone. You can read more from them at their website. View Full User Profile

Developing Your Own Solr Filter

05.17.2012
| 5025 views |
  • submit to reddit

Sometimes Lucene and Solr out of the box functionality is not enough. When such a time comes, we need to extend what Lucene and Solr gives us and create our own plugin. In today's post I’ll try to show you how to develop a custom filter and use it in Solr.

Assumptions

Lets assume, that we need a filter that would allow us to reverse every word we have in a given field. So, if the input is “solr.pl” the output would be “lp.rlos”. It’s not the hardest example, but for the purpose of this entry it will be enough. One more thing – I decided to omit describing how to setup your IDE, how to compile your code, build jar and stuff like that. We will only focus on the code.

Additional Information

Code, which is presented in this post was created using Solr 3.6 libraries, although you shouldn’t have much problems with compiling it with Solr 4 binaries. Keep in mind though that some slight modifications may be needed (in case something changes before Solr 4.0 release).

What We Need

In order for Solr to be able to use our filter, we need two classes. The first class is the actual filter implementation, which will be responsible for handling the actual logic. The second class is the filter factory, which will be responsible for creating instances of the filter. Lets get it done then.

Filter

In order to implement our filter we will extends the TokenFilter class from the org.apache.lucene.analysis and we will override the incrementToken method. This method returns a boolean value – if a value is still available for processing in the token stream, this method should return true, is the token in the token stream shouldn’t be further analyzed this method should return false. The implementation should look like the one below:

package pl.solr.analysis;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public final class ReverseFilter extends TokenFilter {
  private CharTermAttribute charTermAttr;

  protected ReverseFilter(TokenStream ts) {
    super(ts);
    this.charTermAttr = addAttribute(CharTermAttribute.class);
  }

  @Override
  public boolean incrementToken() throws IOException {
    if (!input.incrementToken()) {
      return false;
    }

    int length = charTermAttr.length();
    char[] buffer = charTermAttr.buffer();
    char[] newBuffer = new char[length];
    for (int i = 0; i < length; i++) {
      newBuffer[i] = buffer[length - 1 - i];
    }
    charTermAttr.setEmpty();
    charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length);
    return true;
  }
}

Description of the Above Implementation

A few words about some of the lines of code in the above implementation:

  • Line 9 – class which extends TokenFilter class and will be used as a filter should be marked as final (Lucene requirement).
  • Line 10 – token stream attribute, which allows us to get and modify the text contents of the term. If we would like, our filter could have used more than a single stream attribute, for example one like attribute for getting and changing position in the token stream or payload one. List of Attribute interface implementation can be found in Lucene API (ie. http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/util/Attribute.html).
  • Lines 12 – 15 - constructor which takes token stream as an argument and then adding (line 14) appropriate token stream attribute.
  • Lines 18 – 30incrementToken method implementation.
  • Lines 19 – 21 – check if token is available for processing. If not return false.
  • Line 23 – getting the size of the buffer contents of which we want to reverse.
  • Line 24 – getting the buffer in which we have the word we want to reverse.  Term text in stored as char array and thus the best one, will be to use it and not construct String object.
  • Lines 25 – 28 – create a new buffer and reverse the actual one.
  • Line 29 – clean the original buffer (needed in case of using append methods).
  • Line 30 – copy the changes we made to the buffer of the token stream attribute.
  • Line 31 – return true in order to inform that there is a token available for further processing.

Filter Factory

As I wrote earlier, in order for Solr to be able to use our filter, we need to implement filter factory class. Because, we don’t have any special configuration values and such, factory implementation should be very simple. We will extends BaseTokenFilterFactory class from the org.apache.solr.analysis package. The implementation can look like the following:

package pl.solr.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenFilterFactory;

public class ReverseFilterFactory extends BaseTokenFilterFactory {
  @Override
  public TokenStream create(TokenStream ts) {
    return new ReverseFilter(ts);
  }
}

As you can see filter factory implementation is simple – we only needed to override a single create method in which we instantiate our filter and return it.

Configuration

After compilation and jar file preparation, we copy the jar to a directory Solr will be able to see it. We can do this by creating the lib directory in the Solr home directory and then adding the following entry to the solrconfig.xml file:

<lib dir="../lib/" regex="*.jar" />

Then we change the schema.xml file and we add a new field type that will use our filter:

<fieldType name="text_reversed" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="pl.solr.analysis.ReverseFilterFactory" />
  </analyzer>
</fieldType>

It is worth to note, that as class attribute value of the filter tag we provide the full package and class names of the factory we created, not the filter itself. It is important to remember that, otherwise Solr will throw errors.

Does it Work ?

In order to show you that it works, I provide the following screen shot of the Solr administration panel:

To Sum Up

As you can see on the above example creating your own filter is not a complicated thing. Of course, the idea of the filter was very simple and thus its implementation was simple too. I hope this post will be helpful when the time comes that you need to create your own filter for Solr.



Published at DZone with permission of Rafał Kuć, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: