Performance Zone is brought to you in partnership with:

Open Software Integrators is an open source professional services company that provides consulting,training, development and support. Andrew C. has posted 5 posts at DZone. You can read more from them at their website. View Full User Profile

Transparent Indexing by Hibernate Search

10.31.2012
| 8440 views |
  • submit to reddit

Curator's Note: This article was written by Andrew Ball.  He is a developer for OSI (Open Software Integrators). You can also check out their blog here

Consider the now infamous Granny’s Addressbook application, the one that my company uses to teach and vet new technologies.  It has a simple UI with only a few fields.  You can find our SpringMVC/RDBMS version here and our NodeJS/MongoDB version with AJAX and Appcelerator example front end here.  The application might have a screen that looks like this:


So simple that even Granny could use it, right?  But what about the code behind it if you’re using an RDBMS?  One “dumb” way to write this is as follows:

public List<Address> searchAddresses(String name, boolean nameExact,
  String address, boolean addressExact,
  String email, boolean emailExact,
  String phone, boolean phoneExact) {

  boolean addedOneCondition = false;
  StringBuilder sb = new StringBuilder();
  sb.append("SELECT a FROM Address a");

  if (name != null || address != null || email != null ||
  phone != null ) {
  sb.append(" WHERE ");
  }

  if (name != null) {
  addedOneCondition = true;
  if (nameExact) {
  sb.append(" a.name = :name");
  } else {
  sb.append(" LOWER(a.name) LIKE :name");
  }
  }

  if (address != null) {
  if (addedOneCondition) {
  sb.append(" OR ");
  } else {
  addedOneCondition = true;
  }

  if (addressExact) {
  sb.append(" a.address = :address ");
  } else {
  sb.append(" LOWER(a.address) LIKE :address");
  }
  }

  if (email != null) {
  if (addedOneCondition) {
  sb.append(" OR ");
  } else {
  addedOneCondition = true;
  }

  if (emailExact) {
  sb.append(" a.email = :email");
  } else {
  sb.append(" LOWER(a.email) LIKE :email");
  }
  }

  if (phone != null) {
  if (addedOneCondition) {
  sb.append(" OR ");
  } else {
  addedOneCondition = true;
  }

  if (phoneExact) {
  sb.append(" a.phone = :phone");
  } else {
  sb.append(" LOWER(a.phone) LIKE :phone");
  }
  }

  Query q = em.createQuery(sb.toString(), Address.class);
  if (name != null) {
  if (nameExact) {
  q.setParameter("name", name);
  } else {
  q.setParameter("name", "%" + name + "%");
  }
  }
  if (address != null) {
  if (addressExact) {
  q.setParameter("address", address);
  } else {
  q.setParameter("address", "%" + address + "%");
  }
  }
  if (email != null) {
  if (emailExact) {
  q.setParameter("email", email);
  } else {
  q.setParameter("email", "%" + email + "%");
  }
  }
  if (phone != null) {
  if (phoneExact) {
  q.setParameter("phone", phone);
  } else {
  q.setParameter("phone", "%" + phone + "%");
  }
  }

  return q.getResultList();
}



This performs terribly, even if you add an index for every column and combination of columns on which you could possibly search (an approach that will likely make any reasonably competent DBA upset, as it would destroy write performance.)

A worst-case SQL query generated from the above code would resemble the following:

SELECT * FROM ADDRESS WHERE
LOWER("name") LIKE '%sue%' OR
LOWER("address") LIKE '%Morgan St.%' OR
LOWER("phone") LIKE '%555.555.5555%' OR
LOWER("email") LIKE '%sue.snodgrass@gmail.com%';



Note that there are no indexes for anything but the “id” column. With PostgreSQL, an “EXPLAIN ANALYZE VERBOSE” on the above query shows that every single row of the table would be scanned to execute this query, checking for matches of each pattern:

Seq Scan on public.address  (cost=0.00..10.60 rows=1 width=2072) (actual time=0.049..0.052 rows=1 loops=1)
  Output: id, address, email, name, phone
  Filter: (((address.name)::text ~~* '%sue%'::text) OR ((address.address)::text ~~* '%Morgan St.%'::text) OR ((address.phone)::text ~~* '%555.555.5555%'::text) OR ((address.email)::text ~~* '%sue.snodgrass@gmail.com%'::text))
 Total runtime: 0.099 ms



But that doesn’t even begin to scratch the surface for issues like variations in phone number formats, nicknames (Did I enter “Sue” or “Susan”?), etc. Why can’t I just let Google search the data for me? Well, with Hibernate Search you can achieve something quite similar, with all open source tools and minimal effort. Hibernate Search is based on the much-acclaimed Apache Lucene project, which is very adept at indexing data for full-text searches, including automatically breaking words apart into root words and their inflections (“stemming”) and allowing for synonym lists.

So, how do we go about getting Hibernate Search to enable full-text search for our example entity? The first step is to add the necessary JBoss repositories to our Maven pom.xml file if they aren’t there already:

<repositories>
  <!-- ... -->
  <repository>
  <id>jboss-public-repository-group</id>
  <name>JBoss Public Maven Repository Group</name>
  <url>https://repository.jboss.org/nexus/content/groups/public-jboss/</url>
  <layout>default</layout>
  <releases>


  <enabled>true</enabled>
  <updatePolicy>never</updatePolicy>
  </releases>
  <snapshots>
  <enabled>true</enabled>
  <updatePolicy>never</updatePolicy>
  </snapshots>
  </repository>
</repositories>
<pluginRepositories>
  <!-- ... ->
  <pluginRepository>
  <id>jboss-public-repository-group</id>
  <name>JBoss Public Maven Repository Group</name>
  <url>https://repository.jboss.org/nexus/content/groups/public-jboss/</url>
  <layout>default</layout>
  <releases>
  <enabled>true</enabled>
  <updatePolicy>never</updatePolicy>
  </releases>
  <snapshots>
  <enabled>true</enabled>
  <updatePolicy>never</updatePolicy>
  </snapshots>
  </pluginRepository>
</pluginRepositories>



Then we can take our JPA-annotated entity and add a few annotations (noted in bold below):

@Entity
@NamedQueries(
  {@NamedQuery(name="Address.findAll",
  query="select a from Address a"),
  @NamedQuery(name="Address.findByName",
  query="select a from Address a where a.name = ?1")})
@Indexed
@AnalyzerDef(name = "customanalyzer",
  tokenizer = @TokenizerDef(factory =
  StandardTokenizerFactory.class),
  filters = {
  @TokenFilterDef(factory = LowerCaseFilterFactory.class),
  @TokenFilterDef(factory = SnowballPorterFilterFactory.class, params = {
  @Parameter(name = "language", value = "English")
  })
  })
public class Address {
  @Id
  @GeneratedValue(strategy=GenerationType.AUTO)
  private Long id;

  @Field(index=Index.TOKENIZED, store=Store.NO)
  private String name;
  @Field(index=Index.TOKENIZED, store=Store.NO)
  private String email;
  @Field(index=Index.TOKENIZED, store=Store.NO)
  private String phone;
  @Field(index=Index.TOKENIZED, store=Store.NO)
  private String address;

  /* . . . */
}



Most of these annotations are fairly straightforward to understand. @Indexed indicates that we want Hibernate Search to manage indexes for this entity. @Field indicates that a particular property is to be indexed. We can specify that we want indexed fields to be tokenized (that is, split into parts, usually words) when indexed or treated as a single token. This means that “Abe” will match “Abe Lincoln” without having to specify that we want to allow extra characters with a  search pattern such as “Abe*”.

The more interesting annotations also have to do with some of the more interesting functionality that Lucene (and by extension) Hibernate Search provides. The @AnalyzerDef gives some extra directions on the kind of processing that we want to happen to the content before indexing takes place. For example, the LowerCaseFiterFactory.class token filter will convert all text to lowercase before indexing occurs. The Snowball-Porter filter factory does stemming of tokens before they are indexed -- that is, root words (“stems”) are extracted, so “hiking”, “hiker”, and “hikers” would all get indexed as “hike”.

After adding a few properties to the JPA META-INF/persistence.xml to tell Hibernate Search where to store the Lucene indexes, we can write a totally different search method as follows:

public List<Address> fullTextSearch(String stringToMatch) {
  FullTextEntityManager ftem = org.hibernate.search.jpa.Search.getFullTextEntityManager(em);

  // build up a Lucene query
  QueryBuilder qb = ftem.getSearchFactory().buildQueryBuilder()
  .forEntity(Address.class).get();
  org.apache.lucene.search.Query luceneQuery = qb
  .keyword()
  .onFields("name", "address", "phone", "email")
  .matching(stringToMatch.toLowerCase())
  .createQuery();

  // wrap the Lucene query in a JPA query
  Query jpaWrappedQuery = ftem.createFullTextQuery(luceneQuery,
  Address.class);

  return jpaWrappedQuery.getResultList();
}



All of this indexing is done transparently by Hibernate Search as entities are persisted, updated, and removed. The performance is light-years ahead of the linear scans of tables done by a relational database. Not to mention, the complete feature set of Apache Lucene is available. What’s not to like? The Hibernate Search project has very good documentation (indeed, much of this implementation comes from that documentation). A simple implementation is on Andrew Ball’s copy of the SpringGrannyMVC project on github at https://github.com/cortextual/OSIL (please use the “search” branch). Happy searching!


Published at DZone with permission of its author, Andrew C. Oliver. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: