Cloud Zone is brought to you in partnership with:

Krystian Lider is co-owner of the SolDevelo - software development company with strong focus on software quality. Krystian has posted 4 posts at DZone. View Full User Profile

Comparison of the Grid/Cloud Computing Frameworks (Hadoop, GridGain, Hazelcast, DAC) - Part I

02.10.2010
| 36672 views |
  • submit to reddit

Some time ago we faced a typical programming challenge: how to perform an enormous amount of computations as fast as possible? The answer is simple: divide the problem into smaller ones, compute them in parallel and gather results. The overall conception isn't complicated, so it can be achieved on a variety of ways.

One option is to do it by yourself. Simply set up an HTTP server to farm out requests RESTfully, and accept responses the same way. You can also use AMQP/JMS, RMI or even the old-school Corba - it depends only on your skills and imagination.

However, you do not have to reinvent the wheel. There are several open-source frameworks, which can be used almost out of the box. Simply download the library, adjust your sources and observe how fast your problems are solved. But which framework should you use? The answer isn't simple and it highly depends on your individual needs. We cannot state, that for example Hadoop will be always the best choice, but we can point out the advantages and disadvantages of several frameworks, so you will be able to choose the best one for you by yourself.

We have decided to give a try the following frameworks:

We believe, that the serious study should be as transparent as possible. This is why we describe all the methods and results and give you access to all the sources. You should be able to repeat all the tests and receive the similar results. If not, please contact us, so we can revise our report.

You can find all sources used during tests in our code repository. Moreover, full report with detailed results is available on our website.

This is part I of our comparison, where we concentrate on the task distribution time. There were no node failures nor transactions rollbacks.

Test environment

Our test environment consisted with 5 machines (named intel1 - intel5), each one with 8 cpu on board, which gave us 40 processing units. You can see the architecture of the test environment on the following figure:

Methodology

We based our benchmark on the mathematical problem known as 'counting monotone boolean functions' (also known as Dedekind's problem). Why such unrealistic problem? Because:

  • it is highly cpu-consuming (one cpu will need more than 800 hours to count all monotone boolean functions for N=8)
  • you do not need to solve the whole problem in a benchmark (we chose a fragment that a single cpu will compute in more than 3 hours)
  • it can be easily divided into an arbitrary number of tasks
  • generated tasks have different computational needs (it is a perfect use case for load balancers)

 

With a such flexible problem in hands, we had decided to prepare three test scenarios:

  • compute problem divided into 33700 tasks (CMBF with arguments: n = 4, level = 1000)
  • compute problem divided into 2705 tasks (CMBF with arguments: n = 4, level = 10000)
  • compute problem divided into 341 tasks (CMBF with arguments: n = 4, level = 100000)

 

All tests were repeated ten times in order to avoid measuring error.

Results - overview

You can find the average results on the following figure:

  • X-axis: number of tasks the problem was divided to
  • Y-axis: average time of the algorithm (in milliseconds)


Library name Tasks: 341 Tasks: 2705 Tasks: 33700
DAC 0.7.1 305 834.70 299 507.70 304 971.93
GridGain 2.1.1 372 279.70 338 310.40 350 744.00
Hazelcast 1.8 348 716.70 321 922.70 335 363.30
DAC 0.9.1 306 076.20 299 815.70 305 303.30
Hadoop 0.20.1 467 042.70 384 331.60 365 660.40

As you can see, all frameworks obtained quite similar results. However, these are the best cases only (we performed two versions of the test for GridGain and Hadoop frameworks). Moreover, the best results were obtained in the middle test case (2705 tasks) with the exception of Hadoop, which gained the best time when there were 33700 tasks.

You will find the detailed methodology (sources, test environment description) and results (all performed test cases with std deviation and average values) on our website.

Summary

The above part I concentrates on the task distribution time. Because of the test environment limitations (only 40 cpu units), all frameworks obtained quite similar results. However, Hadoop was distributing tasks 20%-30% slower than other frameworks, but Hadoop was designed to manipulate large data sets, so the above results are totally understandable.

In the upcoming part II we will concentrate on the fail-over capabilities of the selected frameworks.

Published at DZone with permission of its author, Krystian Lider.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Manuel Jordan replied on Wed, 2010/02/10 - 11:52am

Hello Krystian

Thanks for this article, waiting soon the second part

and increible the high value for GridGain in your statistic graphic

 

Regards

 

-Manuel

Leandro de Oliveira replied on Wed, 2010/02/10 - 1:24pm

Where is the code of your DAC test?

Jakub Sławiński replied on Wed, 2010/02/10 - 4:11pm in response to: Leandro de Oliveira

DAC test was based on the usage_of_the_CMBF sample from the DAC distribution. You can grab it from the download page.

 

However, in order to be consistent, I have also added it to the http://dacframe.org/lab, so now all java sources used in the tests can be downloaded from the lab repository. Sorry for misleading you.

Nikita Ivanov replied on Thu, 2010/02/11 - 1:43pm

To whoever did this test...

Leaving aside validity of these tests I'm willing to help you to fix your GridGain test. It is, however, strange that you decided to leave these rather obviously bogus results in...

Drop me email at nivanov at gridgain dot com and I'll have someone to help you to configure and use GridGain for this test.

Best,
Nikita Ivanov.
GridGain - Cloud Development Platform

Nikita Ivanov replied on Thu, 2010/02/11 - 2:34pm

Geez...

Just spent 15 mins and rewrote your test in simplified form using Executor service (instead polluting 5 nodes with excessive split). Got results totally in line with the rest of the frameworks. Case's closed.

Guys - next time you do something like that please contact the vendor. We even provide free support for situations like yours :) Amount of features and configurations in GridGain is fairly substantial comparing to other frameworks and despite our best attempts they are not always obvious or easily understood. In your case - you are hitting a known edge-case problem with "siblings" explosion for which there's a well-known solution and which we'll optimize even further in the next release.

Best,
Nikita Ivanov.
GridGain - Cloud Development Platform

Jeff Darcy replied on Fri, 2010/02/12 - 9:09am

It might also be worth checking out Sector/Sphere (http://sector.sourceforge.net/) which are very like Hadoop/HDFS but claim to be much faster for some problems.

William Louth replied on Fri, 2010/02/12 - 9:35am in response to: Nikita Ivanov

Hi Nikita, Is this not the issue I reported in Oct 2009 at a customer site that was performing a similar benchmark? If so then you might forgive them a little here though I think they had a duty to contact you prior to publication due to the severity of the issue. Kind regards, William

William Louth replied on Fri, 2010/02/12 - 9:39am

The closeness of the results indicate to me that the test itself it not measuring what we would like - the underlying trade-offs made in the implementation of each framework. In addition the numbers would have much more meaning outside of this narrow case if cpu/network/memory utilization was reported as well.

Krystian Lider replied on Fri, 2010/02/12 - 11:06am in response to: Nikita Ivanov

I have updated the article and the content on the website with the results for GridGain using Executor service.

 

Krystian Lider replied on Fri, 2010/02/12 - 11:30am in response to: William Louth

William thank you for your valuable comments.

This test is an opportunity for us to gather feedback from the community and prepare better before something bigger.
At the end we would like to test different frameworks using collected suggestions on the environment with 1024 nodes.

Nikita Ivanov replied on Fri, 2010/02/12 - 12:04pm

@William: yes, the problem is the same. It's an edge-case but nonetheless we'll be addressing it in 3.0. @Kilder: thanks for re-running the test. I would also suggest to show % delta between results even though they are negligible the chart is not very informative (or put on log scale). Thanks, Nikita.

Dmitriy Setrakyan replied on Fri, 2010/02/12 - 3:57pm

Kristian,

Next time when you decide to publish performance benchmarks so obviously biased towards DAC, please disclose your affiliation with DAC project. It looks like you are the owner of that project, or if not the owner, then a major committer. I just checked DAC bug-tracking system and your name is on over 90% of the bugs filed there: http://www.dacframe.org/trac/dac/report/6 .

I am sure that if I was to conduct these benchmarks, then I could structure it that GridGain comes up on top, then there would be Hazelcast, and then Hadoop. I would not even have included DAC as I have never heard of it until now, and it looks like its a very early stage project according to its website.

Best,
Dmitriy Setrakyan
GridGain Team - www.gridgain.com

Manuel Jordan replied on Fri, 2010/02/12 - 7:48pm

For GridGain team

 Hello Guys, just some doubts

 1) some plan to write a book?,  It would be great really

 2) Dmitri, I have read in your blog GridGain Vs. Hadoop (Continued)

     the date is old "May 7, 2008", perhaps GridGain would do a new comparison?

 

Best Regards

-Manuel

 

 

 

Dmitriy Setrakyan replied on Fri, 2010/02/12 - 8:38pm in response to: Manuel Jordan

Hi Manuel,

1. We have been thinking about a book for a while. So maybe some day :)
2. As far as Hadoop comparision - we are in the midst of releasing our 3.0 version with bunch of new features including auto-scaling on the cloud and partitioned data grid. Perhaps, I will write another comparison after the release.

Stay tuned!

Best
Dmitriy, GridGain Team - www.gridgain.com

Manuel Jordan replied on Sat, 2010/02/13 - 11:50am in response to: Dmitriy Setrakyan

Hello Dmitriy

Thanks for the reply

I hope see the book with a good support integration for Spring and Hibernate
I will waiting carefully such comparison, let know us via Dzone to be stay tuned!

I want ask for a favor, I have seen the documentation section in GridGain,
would be possible give such information in offline approach? (pdf), because read from
the monitor kill the eyes and we have not option to write our own anotations/observation
in some place, like a notebook

Best Regards

-Manuel

John Catherino replied on Sun, 2010/02/21 - 6:20pm

Hi Krystian,

if you have about 10 minutes, you could try the cajo project; but I warn you: you might become forever spoilt over the extreme simplicity, and high performance. ;-)

 Regards,

 John

PS it also compiles and runs perfectly as 64-bit, then the performance is even more surprising!

Robert Ragno replied on Wed, 2010/02/24 - 12:01pm

I could easily be missing something here. If not...

Here is my problem with the test: This happens to be a weak implementation of the algorithm in question.

That isn't always a big deal, but it is definitely one when examining optimized performance. It just does not necessarily show the important and relative distinctions between the systems when not using code similar to what would be used in production or experimentation.

To highlight this, let me give a system that dominates every one here: A single light, old machine with a tweaked version of the code.

This is a machine that is years old and far weaker than any of the nodes in the experiment here. 2 CPUs, 2 cores each; 1GB of memory.

n = 4 level = 1000: 195.6 seconds
n = 4 level = 10000: 168.1 seconds
n = 4 level = 100000: 165.4 seconds

On a somewhat more modern machine - 4 CPUs (quad-core) - although still a year old, and only using 100MB of memory, I get:

n = 4 level = 1000: 73.0 seconds
n = 4 level = 10000: 64.0 seconds
n = 4 level = 100000: 62.8 seconds

I'm a little afraid you may have described the test inaccurately, or I may have misunderstood it. But, this is the packaged code run with the specified parameters. On a single machine, it is almost 5x faster than the whole cluster.

Am I just confused on something?

I'm not saying the distributed systems couldn't also be made faster. However, you really want to be benchmarking on something realistic. Otherwise, your comparisons may be skewed or disproportionate compared to real workloads.

Krystian Lider replied on Thu, 2010/02/25 - 4:13am in response to: John Catherino

Hi John

The reason why DAC is so efficient in task distributing is that we use cajo internally :)
(as one of the possible transports )

Regards,
Krystian

Krystian Lider replied on Thu, 2010/02/25 - 4:49am in response to: Robert Ragno

Hi Robert

Your results look a little bit strange :)
As you can see at Grid Comparison - cumulative cost
 counting CMBF should take around 11 888 seconds on a single CPU.

Could you please send me your version of the code?

Krystian Lider replied on Thu, 2010/02/25 - 11:02am

Hi Robert

Your version is very well optimised :)

After running it on our environment we have received following results:
n = 4 level = 1000: 26.1 seconds
n = 4 level = 10000: 20.7 seconds
n = 4 level = 100000: 20.2 seconds

so only three times faster than in your second test.
This means that our CPU cores are about three times slower than yours.


It looks like we did not sufficiently stress the fact that our main goal was to compare tasks distribution time of various frameworks.
From this perspective the only relevant result of this test is the difference from single threaded version of CMBF

Difference with CMBF

I have a challenge for you: are you able to count CMBF for N = 9 ?

;)

 

Robert Ragno replied on Thu, 2010/02/25 - 4:39pm in response to: Krystian Lider

 Hmm, I am a little confused, still. What is the 20 seconds for? Is that running on a single node? If so, it would say *your* machine is 3 times faster. Is it modified to be distributed by one of the systems?

  I doubt that either of the machines I tested with is actually 3 times as fast. They aren't that great. However, various aspects of the context can make a big difference - the JRE version, an optimized OS, the options used when starting the JVM. What is your precise setup, in terms of these and the processor model?  

  In any case, I think I do get your intent, but I would still push my assertions. It remains true that an optimized version of the code provides much faster results using half a node than the original code does with the best distributed system across the cluster of five nodes. So, I think it is important to note that jumping to any distributed system early on is often not an efficient way to speed up high performance computing - you really want to optimize your processes *first*. You may even change your mind about how and what to distribute. 

(It also is relevant to ask, "How does this compare to spending resources to optimize the process?" That isn't free, and it is often the tradeoff under consideration. There definitely are tasks where optimizing is of little use - such as web crawling - but these are not compute-intensive.)

But, more core to your task, using a realistic test case is important. It can change the relative overhead and the comparisons of the systems. There is a reason that you didn't just spin in a sleep loop as your test; you want something that reflects real workloads.

Now, the N=9 is interesting... That would likely take quite a while. Might be worth even optimizing further...

Krystian Lider replied on Fri, 2010/02/26 - 7:53am in response to: Robert Ragno

 

Below I will try to answer for all your questions.

 

 Hmm, I am a little confused, still. What is the 20 seconds for? Is that running on a single node? If so, it would say *your* machine is 3 times faster. Is it modified to be distributed by one of the systems?

 

This 20 seconds was achieved on our cluster using DAC with your optimized code.

 

I doubt that either of the machines I tested with is actually 3 times as fast. They aren't that great. However, various aspects of the context can make a big difference - the JRE version, an optimized OS, the options used when starting the JVM. What is your precise setup, in terms of these and the processor model?  

 

  We used Sun's JVM with default settings.

  After your comment we have updated test environment description.

 

In any case, I think I do get your intent, but I would still push my assertions. It remains true that an optimized version of the code provides much faster results using half a node than the original code does with the best distributed system across the cluster of five nodes. So, I think it is important to note that jumping to any distributed system early on is often not an efficient way to speed up high performance computing - you really want to optimize your processes *first*. You may even change your mind about how and what to distribute.

(It also is relevant to ask, "How does this compare to spending resources to optimize the process?" That isn't free, and it is often the tradeoff under consideration. There definitely are tasks where optimizing is of little use - such as web crawling - but these are not compute-intensive.)

 

 Yes, you are absolutely right.

 

But, more core to your task, using a realistic test case is important. It can change the relative overhead and the comparisons of the systems. There is a reason that you didn't just spin in a sleep loop as your test; you want something that reflects real workloads.

 

 Do you have any specific realistic test case in mind? It will be great to hear new opinion. 

 

Now, the N=9 is interesting... That would likely take quite a while. Might be worth even optimizing further...

 

 Please notice that if you manage to do this, you will be the first who achieved that.  Please check first optimizations on CMBF for N=8. In order to do that you have to use 6 instead of 4 in the application arguments. (Only for your information, counting CMBF for N=7 on single PC takes less than 0.5 second.)

 

 Thank you for your valuable comments!

 

 Regards

 Krystian

 

Robert Ragno replied on Fri, 2010/02/26 - 1:38pm in response to: Krystian Lider

Ah. So, assuming DAC is perfectly efficient, you measured 100 machine-seconds in total. That is squarely between the two machines I reported, which makes some sense. (Especially since my slow one was 4 core, and the fast one was 16-core.) What do you see on a single one of those nodes? Also, are you running 64-bit?

I wouldn't use the default settings for Java. They won't be ideal for high-performance tasks. I would also make sure you are using 1.6.0_18. The options do matter - setting the memory can be very important, and compressed-oops can be substantial. And, of course, using the 64-bit JVM will help here. Using the server mode is critical, too, although you may be defaulting to that.

I wasn't clear when I said "using a realistic test case is important". I was asserting that that is what you are doing here (as opposed to something very synthetic). The only thing I was adding was that you want to test not just a meaningful algorithm but an optimized version of it, because that is what you would use in practice.

I do wish there were more real test cases around, but people don't seem to share them much... And Hadoop at least has never had much focus on benchmarking...

I like the straightforward nature of this experiment. I think it makes a clear statement that in the case of an easily-parallelized, low-data, CPU-intensive task, the distribution framework should not get in the way - performance should be close to the ideal. I think it is a good balance for the common tasks like terasort, which in reality are all about shufflinf data around.

I was understating it a little when I said you might want to optimize more for N=8. I am confused about your numbers, though. For 9, I would use 8? And 7 takes 0.5 seconds?

Thanks for adding the interesting detail to the writeup...

Krystian Lider replied on Mon, 2010/03/01 - 5:39am in response to: Robert Ragno

We used 32-bit JVM because of the hardware architecture and the version was 1.6.0_13, although next time we will use 1.6.0_18.

 I was understating it a little when I said you might want to optimize more for N=8. I am confused about your numbers, though. For 9, I would use 8? And 7 takes 0.5 seconds?

 You can find more information about Dedekind's problem at Mathpages.

When we were talking about N=7,8,9 we meant M(7), M(8), M(9).

Unfortunately we were also talking about parameter n which is only implementation detail to divide the problem to subtasks. 

 Regards,
Krystian

Robert Ragno replied on Mon, 2010/03/01 - 3:44pm in response to: Krystian Lider

The E5410 certainly supports x64. You should definitely be using a 64-bit OS on these machines for any performance testing. With the compressed-oops support, I would also propose that the 64-bit JVM is best for most performance testing, although you could try both for a task. Even if you used a 32-bit JVM, it should still be on a 64-bit OS - I wouldn't support a 32-bit OS on a server machine for many years.

The current JRE is important because of improvements, although it is more critical with a Nehalem machine.

I thought your 'level' parameter controlled the division into subtasks. You are saying that is actually the 'n' parameter? That doesn't seem consistent with the output. Then how do you configure the program for M(7), M(8), M(9)?

Krystian Lider replied on Mon, 2010/03/08 - 9:39am in response to: Robert Ragno

Both parameters control the division into subtasks:

 - n - indicates part of the problem to compute

 - level - estimated subtask size

 

Unfortunately this program can only compute M(8) without modifications, its algorithm is taken from the following paper: Algorithms counting monotone Boolean functions

Krystian Lider replied on Mon, 2010/03/22 - 4:20am

The part II is now available.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.