PHP, Python and Java developer located at Hvaler, Norway. Main interests include digital mapping, search and scalability. Mats is a DZone MVB and is not an employee of DZone and has posted 28 posts at DZone. You can read more from them at their website. View Full User Profile

Writing a Solr Analysis Filter Plugin

06.30.2011
| 12484 views |
  • submit to reddit

As we’ve been working on getting a better result out of the phonetic search we’re currently doing at derdubor, I started writing a plugin for Solr to be able to return better search results when searching for norwegian names. We’ve been using the standard phonetic filter from Solr 1.2 so far, using the double metaphone encoder for encoding a regular token as a phonetic value. The trouble with this is that a double metaphone value is four simple letters, which means that searchwords such as ‘trafikkontroll’ would get the same meaning as ‘Dyrvik’. The latter being a name and the first being a regular search string which would be better served through an article view. TRAFIKKONTROLL resolves to TRFK in double metaphone, while DYRVIK resolves to DRVK. T and D is considered similiar, as is V and F, and voilá, you’ve got yourself a match in the search result, but not a visual one (or a semantic one, as the words have very different meanings).

To solve this, I decided to write a custom filter plugin which we could tune to names that are in use in Norway. I’ll post about the logic behind my reasoning in regards to wording later and hopefully post the complete filter function we’re applying, but I’ll leave that for another post.

First you need a factory that’s able to produce filters when Solr asks for them:

NorwegianNameFilterFactory.java:

    package no.derdubor.solr.analysis;
     
    import java.util.Map;
     
    import org.apache.solr.analysis.BaseTokenFilterFactory;
    import org.apache.lucene.analysis.TokenStream;
     
    public class NorwegianNameFilterFactory extends BaseTokenFilterFactory
    {
        Map<String,String> args;
     
        public Map<String,String> getArgs()
        {
            return args;
        }
     
        public void init(Map<String,String> args)
        {
            this.args = args;
        }
     
        public NorwegianNameFilter create(TokenStream input)
        {
            return new NorwegianNameFilter(input);
        }
    }

To compile this example yourself, put the file in no/derdubor/solr/analysis/ (which matches no.derdubor.solr.analysis; in the package statement), and run

javac -6 no/derdubor/solr/analysis/NorwegianNameFilterFactory.java

(you’ll need apache-solr-core.jar and lucene-core.jar in your classpath to do this)

to compile it. You’ll of course also need the filter itself (which is returned from the create-method above):

    package no.derdubor.solr.analysis;
     
    import java.io.IOException;
    import org.apache.lucene.analysis.Token;
    import org.apache.lucene.analysis.TokenFilter;
    import org.apache.lucene.analysis.TokenStream;
     
    public class NorwegianNameFilter extends TokenFilter
    {
        public NorwegianNameFilter(TokenStream input)
        {
            super(input);
        }
     
        public Token next() throws IOException
        {
            return parseToken(this.input.next());
        }
     
        public Token next(Token result) throws IOException
        {
            return parseToken(this.input.next());
        }
     
        protected Token parseToken(Token in)
        {
            /* do magic stuff with in.termBuffer() here (a char[] which can be manipulated) */
            /* set the changed length of the new term with in.setTermLength(); before returning it */
            return in;
        }
    }

You should now be able to compile both files:

javac -6 no/derdubor/solr/analysis/*.java

After compiling the plugin, create a jar file which contain your plugin. This will be the “distributable” version of your plugin, and should contain the .class-files of your application.

jar cvf derdubor-solr-norwegiannamefilter.jar no/derdubor/solr/analysis/*.class

Move the file you just created (derdubor-solr-norwegiannamefilter.jar in the example above) into your Solr home directory. This is where you keep your bin/ and conf/ directory (which contains schema.xml, etc). Create a lib directory in the solr home directory. This is where your custom libraries will live, so copy the file into this directory (lib/).

Restart Solr and check that everything still works as it should. If everything still seems normal, it’s time to enable your filter. In one of your <filter>-chains, you can simply append a <filter> element to insert your own filter into the chain:

    <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="no.derdubor.solr.analysis.NorwegianNameFilterFactory" />
    </analyzer>

Restart Solr again, and if everything still works as it should, you’re all set! Time to index some new data (remember that you’ll need to reindex the data for things to work as you expect, since no stored data is processed when you edit your configuration files) and commit it! Do a few searches through the admin interface to see that everything works as it should. I’ve used the “debug” option to .. well, debug .. my plugin while developing it. A very neat trick is to see what terms your filter expands to (if you set type=”query” in the analyzer section, it will be applied to all queries against that field), which will be shown in the first debug section when looking at the result (you’ll have to scroll down to the end to see this). If you need to debug things to a greater extend, you can attach a debugger or simply use the Good Old Proven Way of println! (these will end up in catalina.out in logs/ in your tomcat directory). Good luck!

Potential Problems and How To Solve Them

  • If you get an error about incompatible class versions, check that you’re actually running the same (or newer) version of the JVM (java -version) on your Solr search server that you use on your own development machine (use -5 to force 1.5 compatible class files instead of 1.6 when compiling).
  • If you get an error about missing config or something similiar, or that Solr is unable to find the method it’s searching for (generally triggered by an ReflectionException), remember to define your classes public! public class NorwegianNameFilter is your friend! It took at least half an hour until I realized what this simple issue was…

Any comments and followups are of course welcome!

References
Published at DZone with permission of Mats Lindh, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Sirikant Noori replied on Sun, 2012/01/15 - 12:03pm

Thanks for your hints.

It took me 2h to find out that I forgot to define my class PUBLIC :)

Carla Brian replied on Fri, 2012/06/08 - 10:16pm

I am new to this . I am not familiar with it yet. I need more resources about this one. - Markus Lattner

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.