Big Data/Analytics Zone is brought to you in partnership with:

My passion is building crawlers and search engines. In particular, I specialize in building vertical search engines like Indeed.com, Homethinking.com, Bright.com and Enormo.com (all companies I've worked with). I've also worked on products such as Atlassian Jira and Confluence to improve their search capabilities. Kelvin has posted 22 posts at DZone. You can read more from them at their website. View Full User Profile

A Phrase-based, Out-of-order Solr Autocomplete Suggester

09.22.2013
| 3012 views |
  • submit to reddit

Solr has a number of Autocomplete implementations that are great for most purposes. However, a client of mine recently had some fairly specific requirements for Autocomplete:

1. Phrase-based substring matching
2. Out-of-order matches ('foo bar' should match 'the bar is foo')
3. Fallback matching to a secondary field when substring matching on the primary field fails, e.g., 'windstopper jac' doesn't match anything on the 'title' field, but matches on the 'category' field

The most direct way to model this would probably have been to create a separate Solr core and use n-gram plus shingles indexing, along with Solr queries, to obtain results. However, because the index was fairly small, I decided to go with an in-memory approach.

The general strategy was:

1. For each entry in the primary field, create n-gram tokens, adding entries to a Guava Table where key is n-gram, column is string and value is a distance score.
2. For each entry in the secondary field, create n-gram tokens and add entries to a Guava Multimap where key is n-gram and value is term.
3. When an Autocomplete query is received, split it by space, then do look-ups against the primary table.
4. If no matches are found, look-up against the secondary Multimap.
5. Return results.

The scoring for the primary table was a simple one based on length of word and distance of token from the start of the string.

Published at DZone with permission of its author, Kelvin Tan. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Karol Duleba replied on Mon, 2013/09/23 - 5:04pm

A while ago I had similar problem ("intelligent" prompt). Additionally part of query could be in one document and part in the other one, we needed to balance number of documents with relevance.

To solve it I wrote small custom pluggin and used it with field collapsing.

Plugin it self need some more work, but it's good enough for now: https://github.com/mrfuxi/TokenMatcher

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.