PhD Student at University of Bonn and Research Assistant at Fraunhofer IAIS. Dino has posted 1 posts at DZone. You can read more from them at their website. View Full User Profile

Apache Solr Test Suite - Getting Optimal Response Times

04.11.2012
| 5857 views |
  • submit to reddit

In this article we will try to present the importance of understanding the data in pursuit of optimal response times for a search engine. By comprehending the data and the user you are able to tune a good generic search tool to meet your needs and demands. At the beginning article will present some facts (and numbers) trying to demonstrate the effect of term frequency and number of frequent terms in a single query on search performance. Tests were executed against an index of up to 7 million documents. Test suite contained several tests that were run in stages at which we were executing queries containing up to 7 terms.

Test suite results – facts and numbers

The first test demonstrates the relation of highly frequent index terms to search performance. Response times are given in milliseconds and the queries are sorted from fastest to slowest. Highly frequent terms have frequencies in range 50 000 – 150 000.

Comparing average response times for queries from this test the following graph is derived (average response times are given in milliseconds).

Now if one of few terms with a frequency greater than 150 000 is added to query we see clear performance degradation.

Majority of performance degradation was caused by term with frequency greater than 150 000 (this could be the term that appears in most of the documents).

The second test demonstrates relation of terms with medium frequency to search performance (response times are given in milliseconds and queries are sorted from slowest to fastest). Medium frequent terms have frequency values in range 10 000 – 50 000.

Again, after comparing the average times for each query type we obtain the following graph (average response times are given in milliseconds).

The third test demonstrates relation of terms with low frequency to search performance (response times are given in milliseconds and queries are sorted from slowest to fastest). Terms with low frequencies have frequency values in range 100 – 500 (elected terms).

Comparing average response times for each query type the following graph for low frequency term queries is derived (response times are given in milliseconds).

Finally, comparing queries with equal number of terms and different frequency class following graphs are derived.

Why are some queries significantly slower?

The slowest queries are the ones with commonly occurring terms (terms with the highest frequency). Queries with common terms (highly frequent) take longer because the data structures containing the lists of documents containing those terms in the index are larger. While processing user query search engine has to look at the list of document ids for each term in the query and determine which documents contain all the terms. The longer the lists, the more data must be read from disk into memory and more processing must be done to compare the lists[1].

Phrase queries take even longer than Boolean queries because when search engine processes phrase queries it needs to insure not only that all the words in the query are in a document, but that the words occur in the document adjacent to each other and in the same order as the query. In order to do this search engine needs to consult the positions index that indexes the position at which each word occurs in each document[2].

The size of document ids index and position index determines the amount of data that needs to be read from the cache or the disk to execute the query. In case of indices with really long documents phrase queries could require reading gigabytes of data due to size of position index and Boolean queries would settle with only few megabytes of data. This has implications on caching and performance of search engine.

Because reading data from disk is slower than getting the data from memory, search engines implement caching schemes to keep frequently used data in memory. Solr relies on operating system’s disk cache to cache data from ids list and position list. This is why two of the more frequently suggested performance tuning tips on Solr mailing list are:

  • Make sure that the server has enough memory to fit most of the index into memory,
  • Run a large number of „realistic“queries against the server to warm the cache before doing performance testing.


If our index size exceeds the RAM capacity of our machines once the cache is full each further query will cause data to be evicted from cache. Hence, phrase query which is not executed very often and requires more data to complete would cause eviction from cache of most frequently used data, forcing other queries to read from disk instead of cache and resulting in performance degradation. This is called cache pollution.

The good news is that most of our queries are less than 500 ms which is acceptable from user perspective. The potential problem could be the queries that take 500+ ms to complete. In our experience so far, response time scales linearly with index size, so when we scale up the index to 10+ million volumes, the queries that took more than 500 ms could take much longer. Also, these response times were obtained by executing queries one at a time. However, in production we may get new queries faster than SOLR can process them. Those additional queries will be contending with the „slow“ queries for resources such as disk access, cpu and memory and that could be the cause of one additional performance overhead.

Hence, when composing the queries search engine should be aware of highly frequent terms and able to comprehend the user intention (demand). In this way an optimal query could be derived from performance perspective that provides satisfactory response to user demand.


[1] See the appendix on following URLs: http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html; http://nlp.stanford.edu/IR-book/html/htmledition/processing-boolean-queries-1.html

[2] See the appendix on following URL:  http://nlp.stanford.edu/IR-book/html/htmledition/positional-postings-and-phrase-queries-1.html

Code snippet used to extract terms in a given frequency range

public Map<Term, Integer> extractTermsByThreshold(IndexReader reader, String fieldName, int minFrequency, int maxFreqeuncy, int maxNumberOfTerms) throws CorruptIndexException, IOException{
 
Map<Term, Integer> collectedTerms = new HashMap<Term, Integer>();
 
    TermEnum terms = reader.terms();
 
    while(terms.next()){
String field = terms.term().field();
        if (!field.equals(fieldName)) {
            continue;
        }
 
        if(terms.docFreq() > minFrequency && terms.docFreq() < maxFreqeuncy){
            collectedTerms.put(terms.term(), terms.docFreq());
        }
 
        if(collectedTerms.size() > maxNumberOfTerms){
            break;
        }
    }
 
    return collectedTerms;
}



Published at DZone with permission of its author, Dino Oglic. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)