Comparison of the Grid/Cloud Computing Frameworks (Hadoop, GridGain, Hazelcast, DAC) - Part I
Some time ago we faced a typical programming challenge: how to perform an enormous amount of computations as fast as possible? The answer is simple: divide the problem into smaller ones, compute them in parallel and gather results. The overall conception isn't complicated, so it can be achieved on a variety of ways.
One option is to do it by yourself. Simply set up an HTTP server to farm out requests RESTfully, and accept responses the same way. You can also use AMQP/JMS, RMI or even the old-school Corba - it depends only on your skills and imagination.
However, you do not have to reinvent the wheel. There are several open-source frameworks, which can be used almost out of the box. Simply download the library, adjust your sources and observe how fast your problems are solved. But which framework should you use? The answer isn't simple and it highly depends on your individual needs. We cannot state, that for example Hadoop will be always the best choice, but we can point out the advantages and disadvantages of several frameworks, so you will be able to choose the best one for you by yourself.
We have decided to give a try the following frameworks:
We believe, that the serious study should be as transparent as possible. This is why we describe all the methods and results and give you access to all the sources. You should be able to repeat all the tests and receive the similar results. If not, please contact us, so we can revise our report.
You can find all sources used during tests in our code repository. Moreover, full report with detailed results is available on our website.
This is part I of our comparison, where we concentrate on the task distribution time. There were no node failures nor transactions rollbacks.
Test environment
Our test environment consisted with 5 machines (named intel1 - intel5), each one with 8 cpu on board, which gave us 40 processing units. You can see the architecture of the test environment on the following figure:
Methodology
We based our benchmark on the mathematical problem known as 'counting monotone boolean functions' (also known as Dedekind's problem). Why such unrealistic problem? Because:
- it is highly cpu-consuming (one cpu will need more than 800 hours to count all monotone boolean functions for N=8)
- you do not need to solve the whole problem in a benchmark (we chose a fragment that a single cpu will compute in more than 3 hours)
- it can be easily divided into an arbitrary number of tasks
- generated tasks have different computational needs (it is a perfect use case for load balancers)
With a such flexible problem in hands, we had decided to prepare three test scenarios:
- compute problem divided into 33700 tasks (CMBF with arguments: n = 4, level = 1000)
- compute problem divided into 2705 tasks (CMBF with arguments: n = 4, level = 10000)
- compute problem divided into 341 tasks (CMBF with arguments: n = 4, level = 100000)
All tests were repeated ten times in order to avoid measuring error.
Results - overview
You can find the average results on the following figure:
- X-axis: number of tasks the problem was divided to
- Y-axis: average time of the algorithm (in milliseconds)
| Library name | Tasks: 341 | Tasks: 2705 | Tasks: 33700 |
|---|---|---|---|
| DAC 0.7.1 | 305 834.70 | 299 507.70 | 304 971.93 |
| GridGain 2.1.1 | 372 279.70 | 338 310.40 | 350 744.00 |
| Hazelcast 1.8 | 348 716.70 | 321 922.70 | 335 363.30 |
| DAC 0.9.1 | 306 076.20 | 299 815.70 | 305 303.30 |
| Hadoop 0.20.1 | 467 042.70 | 384 331.60 | 365 660.40 |
As you can see, all frameworks obtained quite similar results. However, these are the best cases only (we performed two versions of the test for GridGain and Hadoop frameworks). Moreover, the best results were obtained in the middle test case (2705 tasks) with the exception of Hadoop, which gained the best time when there were 33700 tasks.
You will find the detailed methodology (sources, test environment description) and results (all performed test cases with std deviation and average values) on our website.
Summary
The above part I concentrates on the task distribution time. Because of the test environment limitations (only 40 cpu units), all frameworks obtained quite similar results. However, Hadoop was distributing tasks 20%-30% slower than other frameworks, but Hadoop was designed to manipulate large data sets, so the above results are totally understandable.
In the upcoming part II we will concentrate on the fail-over capabilities of the selected frameworks.
- Login or register to post comments
- 19393 reads
- Printer-friendly version
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)




Comments
Manuel Jordan replied on Wed, 2010/02/10 - 11:52am
Hello Krystian
Thanks for this article, waiting soon the second part
and increible the high value for GridGain in your statistic graphic
Regards
-Manuel
Leandro de Oliveira replied on Wed, 2010/02/10 - 1:24pm
Jakub Sławiński replied on Wed, 2010/02/10 - 4:11pm
in response to: lehphyro
DAC test was based on the usage_of_the_CMBF sample from the DAC distribution. You can grab it from the download page.
However, in order to be consistent, I have also added it to the http://dacframe.org/lab, so now all java sources used in the tests can be downloaded from the lab repository. Sorry for misleading you.
Nikita Ivanov replied on Thu, 2010/02/11 - 1:43pm
Leaving aside validity of these tests I'm willing to help you to fix your GridGain test. It is, however, strange that you decided to leave these rather obviously bogus results in...
Drop me email at nivanov at gridgain dot com and I'll have someone to help you to configure and use GridGain for this test.
Best,
Nikita Ivanov.
GridGain - Cloud Development Platform
Nikita Ivanov replied on Thu, 2010/02/11 - 2:34pm
Just spent 15 mins and rewrote your test in simplified form using Executor service (instead polluting 5 nodes with excessive split). Got results totally in line with the rest of the frameworks. Case's closed.
Guys - next time you do something like that please contact the vendor. We even provide free support for situations like yours :) Amount of features and configurations in GridGain is fairly substantial comparing to other frameworks and despite our best attempts they are not always obvious or easily understood. In your case - you are hitting a known edge-case problem with "siblings" explosion for which there's a well-known solution and which we'll optimize even further in the next release.
Best,
Nikita Ivanov.
GridGain - Cloud Development Platform
Jeff Darcy replied on Fri, 2010/02/12 - 9:09am
William Louth replied on Fri, 2010/02/12 - 9:35am
in response to: nivanov
William Louth replied on Fri, 2010/02/12 - 9:39am
Krystian Lider replied on Fri, 2010/02/12 - 11:06am
in response to: nivanov
I have updated the article and the content on the website with the results for GridGain using Executor service.
Krystian Lider replied on Fri, 2010/02/12 - 11:30am
in response to: wl88194
This test is an opportunity for us to gather feedback from the community and prepare better before something bigger.
At the end we would like to test different frameworks using collected suggestions on the environment with 1024 nodes.
Nikita Ivanov replied on Fri, 2010/02/12 - 12:04pm
Dmitriy Setrakyan replied on Fri, 2010/02/12 - 3:57pm
Kristian,
Next time when you decide to publish performance benchmarks so obviously biased towards DAC, please disclose your affiliation with DAC project. It looks like you are the owner of that project, or if not the owner, then a major committer. I just checked DAC bug-tracking system and your name is on over 90% of the bugs filed there: http://www.dacframe.org/trac/dac/report/6 .
I am sure that if I was to conduct these benchmarks, then I could structure it that GridGain comes up on top, then there would be Hazelcast, and then Hadoop. I would not even have included DAC as I have never heard of it until now, and it looks like its a very early stage project according to its website.
Best,
Dmitriy Setrakyan
GridGain Team - www.gridgain.com
Manuel Jordan replied on Fri, 2010/02/12 - 7:48pm
For GridGain team
Hello Guys, just some doubts
1) some plan to write a book?, It would be great really
2) Dmitri, I have read in your blog GridGain Vs. Hadoop (Continued)
the date is old "May 7, 2008", perhaps GridGain would do a new comparison?
Best Regards
-Manuel
Dmitriy Setrakyan replied on Fri, 2010/02/12 - 8:38pm
in response to: dr_pompeii
Hi Manuel,
1. We have been thinking about a book for a while. So maybe some day :)
2. As far as Hadoop comparision - we are in the midst of releasing our 3.0 version with bunch of new features including auto-scaling on the cloud and partitioned data grid. Perhaps, I will write another comparison after the release.
Stay tuned!
Best
Dmitriy, GridGain Team - www.gridgain.com
Manuel Jordan replied on Sat, 2010/02/13 - 11:50am
in response to: dsetrakyan
Thanks for the reply
I hope see the book with a good support integration for Spring and Hibernate
I will waiting carefully such comparison, let know us via Dzone to be stay tuned!
I want ask for a favor, I have seen the documentation section in GridGain,
would be possible give such information in offline approach? (pdf), because read from
the monitor kill the eyes and we have not option to write our own anotations/observation
in some place, like a notebook
Best Regards
-Manuel
John Catherino replied on Sun, 2010/02/21 - 6:20pm
Hi Krystian,
if you have about 10 minutes, you could try the cajo project; but I warn you: you might become forever spoilt over the extreme simplicity, and high performance. ;-)
Regards,
John
PS it also compiles and runs perfectly as 64-bit, then the performance is even more surprising!
Robert Ragno replied on Wed, 2010/02/24 - 12:01pm
I could easily be missing something here. If not...
Here is my problem with the test: This happens to be a weak implementation of the algorithm in question.
That isn't always a big deal, but it is definitely one when examining optimized performance. It just does not necessarily show the important and relative distinctions between the systems when not using code similar to what would be used in production or experimentation.
To highlight this, let me give a system that dominates every one here: A single light, old machine with a tweaked version of the code.
This is a machine that is years old and far weaker than any of the nodes in the experiment here. 2 CPUs, 2 cores each; 1GB of memory.
n = 4 level = 1000: 195.6 seconds
n = 4 level = 10000: 168.1 seconds
n = 4 level = 100000: 165.4 seconds
On a somewhat more modern machine - 4 CPUs (quad-core) - although still a year old, and only using 100MB of memory, I get:
n = 4 level = 1000: 73.0 seconds
n = 4 level = 10000: 64.0 seconds
n = 4 level = 100000: 62.8 seconds
I'm a little afraid you may have described the test inaccurately, or I may have misunderstood it. But, this is the packaged code run with the specified parameters. On a single machine, it is almost 5x faster than the whole cluster.
Am I just confused on something?
I'm not saying the distributed systems couldn't also be made faster. However, you really want to be benchmarking on something realistic. Otherwise, your comparisons may be skewed or disproportionate compared to real workloads.
Krystian Lider replied on Thu, 2010/02/25 - 4:13am
in response to: cajo
Hi John
The reason why DAC is so efficient in task distributing is that we use cajo internally :)
(as one of the possible transports )
Regards,
Krystian
Krystian Lider replied on Thu, 2010/02/25 - 4:49am
in response to: rragno
Your results look a little bit strange :)
As you can see at Grid Comparison - cumulative cost
counting CMBF should take around 11 888 seconds on a single CPU.
Could you please send me your version of the code?
Krystian Lider replied on Thu, 2010/02/25 - 11:02am
Hi Robert
Your version is very well optimised :)
After running it on our environment we have received following results:
n = 4 level = 1000: 26.1 seconds
n = 4 level = 10000: 20.7 seconds
n = 4 level = 100000: 20.2 seconds
so only three times faster than in your second test.
This means that our CPU cores are about three times slower than yours.
It looks like we did not sufficiently stress the fact that our main goal was to compare tasks distribution time of various frameworks.
From this perspective the only relevant result of this test is the difference from single threaded version of CMBF
I have a challenge for you: are you able to count CMBF for N = 9 ?
;)
Robert Ragno replied on Thu, 2010/02/25 - 4:39pm
in response to: klider
Hmm, I am a little confused, still. What is the 20 seconds for? Is that running on a single node? If so, it would say *your* machine is 3 times faster. Is it modified to be distributed by one of the systems?
I doubt that either of the machines I tested with is actually 3 times as fast. They aren't that great. However, various aspects of the context can make a big difference - the JRE version, an optimized OS, the options used when starting the JVM. What is your precise setup, in terms of these and the processor model?
In any case, I think I do get your intent, but I would still push my assertions. It remains true that an optimized version of the code provides much faster results using half a node than the original code does with the best distributed system across the cluster of five nodes. So, I think it is important to note that jumping to any distributed system early on is often not an efficient way to speed up high performance computing - you really want to optimize your processes *first*. You may even change your mind about how and what to distribute.
(It also is relevant to ask, "How does this compare to spending resources to optimize the process?" That isn't free, and it is often the tradeoff under consideration. There definitely are tasks where optimizing is of little use - such as web crawling - but these are not compute-intensive.)
But, more core to your task, using a realistic test case is important. It can change the relative overhead and the comparisons of the systems. There is a reason that you didn't just spin in a sleep loop as your test; you want something that reflects real workloads.
Now, the N=9 is interesting... That would likely take quite a while. Might be worth even optimizing further...
Krystian Lider replied on Fri, 2010/02/26 - 7:53am
in response to: rragno
Below I will try to answer for all your questions.
This 20 seconds was achieved on our cluster using DAC with your optimized code.
We used Sun's JVM with default settings.
After your comment we have updated test environment description.
Yes, you are absolutely right.
Do you have any specific realistic test case in mind? It will be great to hear new opinion.
Please notice that if you manage to do this, you will be the first who achieved that. Please check first optimizations on CMBF for N=8. In order to do that you have to use 6 instead of 4 in the application arguments. (Only for your information, counting CMBF for N=7 on single PC takes less than 0.5 second.)
Thank you for your valuable comments!
Regards
Krystian
Robert Ragno replied on Fri, 2010/02/26 - 1:38pm
in response to: klider
Ah. So, assuming DAC is perfectly efficient, you measured 100 machine-seconds in total. That is squarely between the two machines I reported, which makes some sense. (Especially since my slow one was 4 core, and the fast one was 16-core.) What do you see on a single one of those nodes? Also, are you running 64-bit?
I wouldn't use the default settings for Java. They won't be ideal for high-performance tasks. I would also make sure you are using 1.6.0_18. The options do matter - setting the memory can be very important, and compressed-oops can be substantial. And, of course, using the 64-bit JVM will help here. Using the server mode is critical, too, although you may be defaulting to that.
I wasn't clear when I said "using a realistic test case is important". I was asserting that that is what you are doing here (as opposed to something very synthetic). The only thing I was adding was that you want to test not just a meaningful algorithm but an optimized version of it, because that is what you would use in practice.
I do wish there were more real test cases around, but people don't seem to share them much... And Hadoop at least has never had much focus on benchmarking...
I like the straightforward nature of this experiment. I think it makes a clear statement that in the case of an easily-parallelized, low-data, CPU-intensive task, the distribution framework should not get in the way - performance should be close to the ideal. I think it is a good balance for the common tasks like terasort, which in reality are all about shufflinf data around.
I was understating it a little when I said you might want to optimize more for N=8. I am confused about your numbers, though. For 9, I would use 8? And 7 takes 0.5 seconds?
Thanks for adding the interesting detail to the writeup...
Krystian Lider replied on Mon, 2010/03/01 - 5:39am
in response to: rragno
We used 32-bit JVM because of the hardware architecture and the version was 1.6.0_13, although next time we will use 1.6.0_18.
You can find more information about Dedekind's problem at Mathpages.
When we were talking about N=7,8,9 we meant M(7), M(8), M(9).
Unfortunately we were also talking about parameter n which is only implementation detail to divide the problem to subtasks.
Regards,
Krystian
Robert Ragno replied on Mon, 2010/03/01 - 3:44pm
in response to: klider
The E5410 certainly supports x64. You should definitely be using a 64-bit OS on these machines for any performance testing. With the compressed-oops support, I would also propose that the 64-bit JVM is best for most performance testing, although you could try both for a task. Even if you used a 32-bit JVM, it should still be on a 64-bit OS - I wouldn't support a 32-bit OS on a server machine for many years.
The current JRE is important because of improvements, although it is more critical with a Nehalem machine.
I thought your 'level' parameter controlled the division into subtasks. You are saying that is actually the 'n' parameter? That doesn't seem consistent with the output. Then how do you configure the program for M(7), M(8), M(9)?
Krystian Lider replied on Mon, 2010/03/08 - 9:39am
in response to: rragno
Both parameters control the division into subtasks:
- n - indicates part of the problem to compute
- level - estimated subtask size
Unfortunately this program can only compute M(8) without modifications, its algorithm is taken from the following paper: Algorithms counting monotone Boolean functions
Krystian Lider replied on Mon, 2010/03/22 - 4:20am