My passion is building crawlers and search engines. In particular, I specialize in building vertical search engines like Indeed.com, Homethinking.com, Bright.com and Enormo.com (all companies I've worked with). I've also worked on products such as Atlassian Jira and Confluence to improve their search capabilities. Kelvin has posted 22 posts at DZone. You can read more from them at their website. View Full User Profile

6-Step Lucene Multi-Point Spatial Search Algorithm

04.17.2012
| 5529 views |
  • submit to reddit

This post describes a method of augmenting the lucene-spatial contrib package to support multi-point searches. It is quite similar to the method described http://www.supermind.org/blog/548/multiple-latitudelongitude-pairs-for-a-single-solrlucene-doc with some minor modifications.

The problem is as follows:

A company (mapped as a Lucene doc) has an address associated with it. It also has a list of store locations, which each have an address. Given a lat/long point, return a list of companies which have either a store location or an address within x miles from that point. There should be the ability to search on just company addresses, store locations, or both.


This problem requires that you index a "primary" lat/long pair, and multiple "secondary" lat/long pairs, and be able to search only primary lat/long, only secondary lat/long or both.

This excludes the possibility of using SOLR-2155 or LUCENE-3795 as-is. I'm sure it would have been possible to patch either to do so

Also, SOLR-2155 depended on Solr, and I needed a pure Lucene 3.5 solution. And MultiValueSource, which SOLR-2155 uses, does not appear to be supported in Lucene 3.5.

The SOLR-2155 implementation is also pretty inefficient: it creates a List object
for every single doc in the index in order to support multi-point search.

The general outline of the method is:

1. Search store locations index and collect company IDs and distances
2. Augment DistanceFilter with store location distances
3. Add a BooleanQuery with company IDs. This is to include companies in the final result-set whose address does not match, but have one or more store locations which do
4. Search company index
5. Return results

The algorithm in detail:

1. Index the company address with the company document, i.e the document containing company fields such as name etc

2. In a separate index (or in the same index but in a different document "type"), index the store locations, adding the company ID as a field.

3. Given a lat/long point to search, first search the store locations index. Collect a unique list of company doc-ids:distance in a LinkedHashMap, checking for duplicates. Note that this is the lucene doc-id of the store location's corresponding company, NOT the company ID field value. This will be used to augment the distancefilter in the next stage.

Hint: you'll need to use TermDocs to get this, like so:

for(int i=0;i<locationHits.docs.totalHits;++i) {
      int locationDocId = locationHits.docs.scoreDocs[i].doc;
      int companyId = companyIds[locationDocId];
      double distance = locationHits.distanceFilter.getDistance(locationDocId);
      if(companyDistances.containsKey(companyId)) continue;
      Term t = new Term("id", Integer.toString(companyId));
      TermDocs td = companyReader.termDocs(t);
      if (td.next()) {
        int companyDocId = td.doc();
        companyDistances.put(companyDocId, distance);
      }
      td.close();
    }

Since the search returns results sorted by distance (using lucene-spatial's DistanceFilter), you're assured to have a list of company doc ids in ascending order of distance.

In this same pass, also collect a list of company IDs. This will be used to build the BooleanQuery used in the company search.

4. Set company DistanceFilter's distances. Note: in Lucene 3.5, I added a one-line patch to DistanceFilter so that setDistances() calls putAll() instead of replacing the map.

final DistanceQueryBuilder dq = new DistanceQueryBuilder(centerLat, centerLng, milesF, "lat", "lng", CartesianTierPlotter.DEFALT_FIELD_PREFIX, true, 0, 20);
dq.getDistanceFilter().setDistances(companyDistances);
 

5. Build BooleanQuery including company IDs

    BooleanQuery bq = new BooleanQuery();
    for(Integer id: companyIds) bq.add(new TermQuery(new Term("id", Integer.toString(id))), BooleanClause.Occur.SHOULD);
    bq.add(distanceQuery, BooleanClause.Occur.SHOULD);

6. Search and return results

Published at DZone with permission of its author, Kelvin Tan. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)