Michael loves building software; he's been building search engines for more than a decade, and has been working on Lucene as a committer, PMC member and Apache member, for the past few years. He's co-author of the recently published Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things. Michael is a DZone MVB and is not an employee of DZone and has posted 49 posts at DZone. You can read more from them at their website. View Full User Profile

Near-real-time latency during large merges

06.19.2011
| 6418 views |
  • submit to reddit
I looked into the curious issue I described in my last post, where the NRT reopen delays can become "spikey" (take longer) during a large merge.

To show the issue, I modified the NRT test to kick off a background optimize on startup. This runs a single large merge, creating a 13 GB segment, and indeed produces spikey reopen delays (purple):



The large merge finishes shortly after 7 minutes, after which the reopen delays become healthy again. Search performance (green) is unaffected.

I also added Linux'd dirty bytes to the graph, as reported by /proc/meminfo; it's the saw-tooth blue/green series on the bottom. Note that it's divided by 10, to better fit the Y axis; the peaks are around 800-900 MB.

The large merge writes bytes a fairly high rate (around 30 MB/sec), but Linux buffers those writes in RAM, only actually flushing them to disk every 30 seconds; this is what produces the saw-tooth pattern.

From the graph you can see that the spikey reopen delays generally correlate to when Linux is flushing the dirty pages to disk. Apparently, this heavy write IO interferes with the read IO required when resolving deleted terms to document IDs. To confirm this, I ran the same stress test, but with only adds (no deletions); the reopen delays were then unaffected by the ongoing large merge.

So finally the mystery is explained, but, how to fix it?

I know I could tune Linux's IO, for example to write more frequently, but I'd rather find a Lucene-only solution since we can't expect most users to tune the OS.

One possibility is to make a RAM resident terms dictionary, just for primary-key fields. This could be very compact, for example by using an FST, and should give lookups that never hit disk unless the OS has frustratingly swapped out your RAM data structures. This can also be separately useful for applications that need fast document lookup by primary key, so someone should at some point build this.

Another, lower level idea is to simply rate limit byte/sec written by merges. Since big merges also impact ongoing searches, likely we could help that case as well. To try this out, I made a simple prototype (see LUCENE-3202), and then re-ran the same stress test, limiting all merging to 10 MB/sec:



The optimize now took 3 times longer, and the peak dirty bytes (around 300 MB) is 1/3rd as large, as expected since the IO write rate is limited to 10 MB/sec. But look at the reopen delays: they are now much better contained, averaging around 70 milliseconds while the optimize is running, and dropping to 60 milliseconds once the optimize finishes. I think the ability to limit merging IO is an important feature for Lucene!
References
Published at DZone with permission of Michael Mccandless, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)