Solr-Lucene Zone is brought to you in partnership with:

Michael loves building software; he's been building search engines for more than a decade, and has been working on Lucene as a committer, PMC member and Apache member, for the past few years. He's co-author of the recently published Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things. Michael is a DZone MVB and is not an employee of DZone and has posted 49 posts at DZone. You can read more from them at their website. View Full User Profile

265% indexing speedup with Lucene's concurrent flushing

  • submit to reddit
A week ago, I described the nightly benchmarks we use to catch any unexpected slowdowns in Lucene's performance. Back then the graphs were rather boring (a good thing), but, not anymore! Have a look at the stunning jumps in Lucene's indexing rate:

(Click through the image to see details about what changed on dates A, B, C and D).

Previously we were around 102 GB of plain text per hour, and now it's about 270 GB/hour. That's a 265% jump! Lucene now indexes all of Wikipedia's 23.2 GB (English) export in 5 minutes and 10 seconds.

How did this happen? Concurrent flushing.

That new feature, having lived on a branch for quite some time, undergoing many fun iterations, was finally merged back to trunk about a week ago.

Before concurrent flushing, whenever IndexWriter needed to flush a new segment, it would stop all indexing threads and hijack one thread to perform the rather compute intensive flush. This was a nasty bottleneck on computers with highly concurrent hardware; flushing was inherently single threaded. I previously described the problem here.

But with concurrent flushing, each thread freely flushes its own segment even while other threads continue indexing. No more bottleneck!

Note that there are two separate jumps in the graph. The first jump, the day concurrent flushing landed (labelled as B on the graph), shows the improvement while using only 6 threads and 512 MB RAM buffer during indexing. Those settings resulted in the fastest indexing rate before concurrent flushing.

The second jump (labelled as D on the graph) happened when I increased the indexing threads to 20 and dropped the RAM buffer to 350 MB, giving the fastest indexing rate after concurrent flushing.

One nice side effect of concurrent flushing is that you can now use RAM buffers well over 2.1 GB, as long as you use multiple threads. Curiously, I found that larger RAM buffers slow down overall indexing rate. This might be because of the discontinuity when closing IndexWriter, when we must wait for all the RAM buffers to be written to disk. It would be better to measure steady state indexing rate, while indexing an effectively infinite content source, and ignoring the startup and ending transients; I suspect if I measured that instead, we'd see gains from larger RAM buffers, but this is just speculation at this point.

There were some very challenging changes required to make concurrent flushing work, especially around how IndexWriter handles buffered deletes. Simon Willnauer does a great job describing these changes here and here. Concurrency is tricky!

Remember this change only helps you if you have concurrent hardware, you use enough threads for indexing and there's no other bottleneck (for example, in the content source that provides the documents). Also, if your IO system can't keep up then it will bottleneck your CPU concurrency. The nightly benchmark runs on a computer with 12 real (24 with hyperthreading) cores and a fast (OCZ Vertex 3) solid-state disk. Finally, this feature is not yet released: it was committed to Lucene's trunk, which will eventually be released as 4.0.
Published at DZone with permission of Michael Mccandless, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)