Big Data/BI Zone is brought to you in partnership with:

Ayende Rahien is working for Hibernating Rhinos LTD, a Israeli based company producing developer productivity tools for OLTP applications such as NHibernate Profiler (nhprof.com), Linq to SQL Profiler(l2sprof.com), Entity Framework Profiler (efprof.com) and more. Ayende is a DZone MVB and is not an employee of DZone and has posted 434 posts at DZone. You can read more from them at their website. View Full User Profile

Big Data Search, Part 5: Sorting Optimizations

01.27.2014
| 5228 views |
  • submit to reddit

[Editor's note: This is part 5 of a series. Check out part 1 , part 2, part 3, and part 4 first]

I mentioned several times that the entire point of the exercise was to just see how this works, not to actually do anything production worthy. But it is interesting to see how we could do better here.

In no particular order, I think that there are at least several things that we could do to significantly improve the time it takes to sort. Right now we defined 2 indexes on top of a 1GB file, and it took under 1 minute to complete. That gives us a runtime of about 10 days over a 15TB file.

Well, one of the reason for this performance is that we execute this in a serial fashion, that is, one after another. But we have to completely isolated indexes, there is no reason why we can’t parallelize the work between them.

For that matter, we are buffering in memory up to a certain point, then we sort, then we buffer some more, etc. That is pretty inefficient. We can push the actual sorting to a different thread, and continue parsing and adding to a buffer while we are adding to the buffer.

We wrote to intermediary files, but we wrote to those using plain file I/O. But it is usually a lot more costly to write to disk than to compress and then write to disk.  We are writing sorted data, so it is probably going to compress pretty well.

Those are the things that pop to mind. Can you think of additional options?


Published at DZone with permission of Ayende Rahien, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)