Michael loves building software; he's been building search engines for more than a decade, and has been working on Lucene as a committer, PMC member and Apache member, for the past few years. He's co-author of the recently published Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things. Michael is a DZone MVB and is not an employee of DZone and has posted 38 posts at DZone. You can read more from them at their website. View Full User Profile

Catching slowdowns in Lucene

05.05.2011
| 5041 views |
  • submit to reddit
Lucene has great randomized tests to catch functional failures, but when we accidentally commit a performance regression (we slow down indexing or searching), nothing catches us!

This is scary, because we want things to get only faster with time.

So, when there's a core change that we think may impact performance, we run before/after tests to verify. But this is ad-hoc and error-proned: we could easily forget to do this, or fail to anticipate that a code change might have a performance impact.

Even when we do test performance of a change, the slowdown could be relatively small, easily hiding within the unfortunately often substantial noise of our tests. Over time we might accumulate many such small, unmeasurable slowdowns, suffering the fate of the boiling frog. We do also run performance tests before releasing, but it's better to catch them sooner: solving slowdowns just before releasing is.... dangerous.

To address this problem, I've created a script that runs standard benchmarks on Lucene's trunk (to be 4.0), nightly. It indexes all of Wikipedia's English XML export, three times (with different settings and document sizes), runs a near-real-time (NRT) turnaround time test for 30 minutes, and finally a diverse set of hard queries.

This has been running for a few weeks now, and the results are accessible to anyone.

It's wonderful to see that Lucene's indexing throughput is already a bit faster (~98 GB plain text per hour) than when I last measured!

Near-real-time reopen latency is here; the test measures how long it takes (on average, after discarding outliers) to open a new NRT reader. It's quite intensive, indexing around 1 MB plain text per second as updates (delete+addDocument), and reopening once per second, on the full previously built Wikipedia index.

To put this in perspective, that's almost twice Twitter's recent peak indexing rate during the 2011 Superbowl (4,064 Tweets/second), although Twitter's use-case is harder because the documents are much smaller, and presumably there's additional indexed metadata beyond just the text of the Tweet. Twitter has actually implemented some cool changes to Lucene to enable real-time searching without reopening readers; Michael Busch describes them here and here. Some day I hope these will be folded into Lucene!

Finally, we test all sorts of queries: PhraseQuery (exact and sloppy), FuzzyQuery (edit distance 1 and 2), four variants of BooleanQuery, NumericRangeQuery, PrefixQuery, WildcardQuery, SpanNearQuery, and of course TermQuery. In addition we test the automaton spell checker, and primary-key lookup.

A few days ago, I switched all tests to the very fast 240 GB OCZ Vertex 3 (previously it was a traditional spinning-magnets hard drive). It looks like indexing throughput gained a bit of performance (~102 GB plain text per hour), the search performance was unaffected (expected, because for this test all postings easily fit in available RAM), but the NRT turnaround time saw a drastic reduction in the noise to near-zero. NRT is very IO intensive so it makes sense having a fast IO system improves its turnaround time; I need to dig further into this.

Unfortunately, performance results are inherently noisy. For example you can see the large noise (the error band is +/- one standard deviation) in the TermQuery results; other queries seem to have less noise for some reason.

So far the graphs are rather boring: nice and flat. This is a good thing!
References
Published at DZone with permission of Michael Mccandless, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Chung Fang replied on Fri, 2011/08/12 - 4:35am

Hello

When you think about Cert Sites, what do you think of first? Which aspects of Cert Sites are important, which are essential, and which ones can you take or leave? You be the judge.

 

Knowledge can give you a real advantage. To make sure you're fully informed about Cert Sites, keep reading.

Superb information that you have shared. We would get advantage from this great information. We always like cheap offers whether they are for study, registration of - and other matters. I have much earn from it and spend it on my pass4sure 642-982 and the most expensive pass4sure HP2-Z17. But i am sure this certification and exams clearance would give me return pass4sure JN0-332.

If you've picked some pointers about Cert Sites that you can put into action, then by all means, do so. pass4sure 642-874 You won't really be able to gain any benefits from your new knowledge if you don't use it.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.