Solr-Lucene Zone is brought to you in partnership with:

My passion is building crawlers and search engines. In particular, I specialize in building vertical search engines like Indeed.com, Homethinking.com, Bright.com and Enormo.com (all companies I've worked with). I've also worked on products such as Atlassian Jira and Confluence to improve their search capabilities. Kelvin has posted 22 posts at DZone. You can read more from them at their website. View Full User Profile

Using Guava's Multimap to Improve Solr's Autocomplete Suggester

03.16.2012
| 6426 views |
  • submit to reddit

Context-less, multi-term autocomplete is difficult.

Given the term "di", we can look at our index and rank terms starting with "di" by frequency and return the n most frequent terms. Solr's TSTLookup and FSTLookup do this very well.

However, given the term "walt di", we can no longer do what we did above for each term and not look silly, especially if the corpus in question is a list of US companies (hint: think mickey mouse". There's little excuse to suggesting "walt discovery" or "walt diners" when our corpus does not contain any documents with that combination of terms.

In the absence of a large number of historical user queries to augment the autocomplete, context is king when it comes to multi-term queries.

The simplest way I can think of doing this, if it is feasible to do so memory-wise, is to store a list of terms and the term that immediately follows it. For example, given the field value "international business machines", mappings would be created for

international=>business
business=>machines

Out-of-order queries wouldn't be supported with this system, nor would term skips (e.g. international machines).

Here's a method fragment that does just this:

HashMultimap<String, String> map = HashMultimap.create();
for (int i = 0; i < reader.numDocs(); ++i) {
  Fieldable fieldable = reader.document(i).getFieldable(field);
  if(fieldable == null) continue;
  String fieldVal = fieldable.stringValue();
  if(fieldVal == null) continue;
  TokenStream ts = a.tokenStream(field, new StringReader(fieldVal));
  String prev = null;
  while (ts.incrementToken()) {
    CharTermAttribute attr = ts.getAttribute(CharTermAttribute.class);
    String v = new String(attr.buffer(), 0, attr.length()).intern();
    if (prev != null) {
      map.get(prev).add(v);
    }
    prev = v;
  }
}

Guava's Multimap is perfect for this, and Solr already has a Guava dependency, so we might as well make full use of it.

Published at DZone with permission of its author, Kelvin Tan. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)