Robin Bramley is a hands-on Architect who has spent the last decade working with Java, mobile & Open Source across sectors including Financial Services & High Growth / start-ups. Prior to that he helped UK Police Forces with MIS /reporting & intelligence systems. He has contributed to a wide variety of Open Source projects including adding Open ID support to Spring Security. Robin is a DZone MVB and is not an employee of DZone and has posted 24 posts at DZone. You can read more from them at their website. View Full User Profile

Relevancy Driven Development with Solr

10.29.2011
| 4641 views |
  • submit to reddit

The relevancy of search engine results is very subjective so therefore testing the relevancy of queries is also subjective.
One technique that exists in the information retrieval field is the use of judgement lists; an alternative approach discussed here is to follow the Behaviour Driven Development methodology employing user story acceptance criteria – I’ve been calling this Relevancy Driven Development or RDD for short.

I’d like to thank Eric Pugh for a great discussion on search engine testing and for giving me a guest slot in his ‘Better Search Engine Testing‘ talk* at Lucene EuroCon Barcelona 2011 last week to mention RDD. The first iteration of Solr-RDD combines my passion for automated testing with my passion for Groovy by leveraging EasyB (a Groovy BDD testing framework).

Background – Testing Solr

I’d been applying some of the best practices that I used on Java/Grails projects to Solr, but this initially focused on the performance aspects using the (production) ‘access’ request log from Solr, JMeter plus the Access Log Sampler and of course Jenkins. To cope with the evolutionary nature of the schema and the query (when not using (e)dismax) this was accompanied by some Groovy ‘migration’ scripts:

  • An index dumper script – to walk the Lucene index and export the documents to Solr update XML format
  • A data modifier script – to modify the XML dataset
  • An access log processing script – to update the queries that were replayed

plus delete_all.xml and optimize.xml for use with the Solr post.sh.

Whilst this gave confidence that we could track the performance trends of any query changes or configuration tuning – it didn’t address the relevancy. For that we had another script, known as the SolrJ Query Tool, to execute pre-canned queries – although this didn’t have an automated feedback loop as the results would be emailed to the client for them to assess (there wasn’t a judgement list due to time constraints).

The importance of the controlled/constrained dataset

If you read between the lines above, we would recreate the data to a known state before each test run.
This is critical if you are to be able to make valid assertions about search results.

Solr-RDD

The aim is to use a story format to describe query relevancy e.g.

given our product data set
when I search for ‘exercise bike’
and I sort by price descending
then I should get two results with ids [PRD-123,PRD-234]
and PRD-123 has a higher score than PRD-234

Using SolrJ this should be viewed as an integration test directly against Solr rather than as a functional test which uses an HTTP Client to interact with the primary web application.

Iteration 1

This was the essentially the alpha-grade implementation and for this run through I used Solr 1.4.1 as it was used for the project that made the idea concrete.

Pre-requisites
  • Solr installed and running the example core (e.g. cd /Applications/apache-solr-1.4.1/example/; java -jar start.jar)
  • Download a copy of EasyB from http://www.easyb.org/download.html
  • You’ll also need Ivy if you want to use Groovy dependency management
Dependencies

I highly recommend mvnrepository.com for being able to track down dependencies and they have them directly in Groovy @Grab form.


@Grab(group='org.apache.solr', module='solr-solrj', version='1.4.1')
@Grab(group='org.slf4j', module='slf4j-nop', version='1.6.2')
Imports

import org.apache.solr.client.solrj.*
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.client.solrj.response.*
import org.apache.solr.common.*
The before fixture

SolrServer server
 
before "configure search client", {
 url = 'http://localhost:8983/solr'
 server = new CommonsHttpSolrServer(url)
}
 
before "set up constrained data", {
 given "our sample product data set", {
  SolrInputDocument doc1 = new SolrInputDocument()
  doc1.addField("id", "PRD-123", 1.0f)
  doc1.addField("name", "Best exercise bike", 1.5f)
  doc1.addField("price", 100)
 
  SolrInputDocument doc2 = new SolrInputDocument()
  doc2.addField("id", "PRD-234", 1.0f)
  doc2.addField("name", "Old exercise bike", 1.0f)
  doc2.addField("price", 20)
 
  Collection docs = new ArrayList()
  docs.add(doc1)
  docs.add(doc2)
 
  server.add(docs)
  server.commit()
 }
}
The sample scenario

scenario "Exercise bikes",{
 SolrQuery query = new SolrQuery()
 def rdocs
 
 when "I search for 'exercise bike'", {
  query.setQuery("name:\"exercise bike\"")
  query.addField('score')
 }
 and "I sort by price descending", {
  query.addSortField("price", SolrQuery.ORDER.desc)
 }
 then "I should get two results with ids [PRD-123,PRD-234]", {
  QueryResponse rsp = server.query(query)
  rdocs = rsp.getResults()
 
  rdocs.size().shouldBe(2)
 
  rdocs[0].id.shouldBe('PRD-123')
  rdocs[1].id.shouldBe('PRD-234')
 }
 and "PRD-123 has a higher score than PRD-234", {
  rdocs[0].score.shouldBeGreaterThan(rdocs[1].score)
 }
}
Executing from the Command line

The code from the four sections above was saved as ‘BaseSearch.story‘ – the suffix instructs EasyB that it is a Story (as opposed to a specification).

Typing the following within the EasyB installation directory gives the output as shown in Figure 1:
java -cp easyb-0.9.8.jar:lib/commons-cli-1.2.jar:lib/groovy-all-1.7.5.jar:$GRAILS_HOME/lib/ivy-2.2.0.jar org.easyb.BehaviorRunner ~/Projects/rbramley/solr-rdd/stories/BaseSearch.story -prettyprint

Figure 1: EasyB pretty-printed command line output

If we now change the SolrQuery.ORDER.desc to SolrQuery.ORDER.asc and re-run, we’ll see the failure output as shown in Figure 2.

Figure 2: EasyB failure output

Note that the -txtstory argument will make EasyB output the stories in a ‘business-readable’ form.

Challenges
  • The EasyB classloader prevents the use of an Embedded Solr with the default Solr configuration (e.g. org.apache.solr.common.SolrException: Error loading class 'solr.FastLRUCache')
The Solr-RDD Backlog

The first iteration uses vanilla EasyB, but there is code that needs to be written to remove the necessity of a lot of the boilerplate code:

  1. A DSL to abstract the SolrJ client library (including a query builder)
  2. Data loading integration
  3. Dependency management simplification
  4. EasyB plugin / syntax extension (building on the above items)
  5. Jenkins / Hudson integration

Feel free to suggest additional features or participate via the fledgling project on GitHub.

References

* You can get an older version of Eric’s talk from here.

 

From http://leanjavaengineering.wordpress.com/2011/10/24/solr-rdd/

Published at DZone with permission of Robin Bramley, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Robert Craft replied on Thu, 2012/01/26 - 5:35am

Two things of concern: First, a user would have to “know” the data extremely well to formulate queries in that sort of detail, and Second, it does not appear to leave any room for unexpected information that might also be useful to the user. Perhaps this is a technique that works well with it.

Spring Framework

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.