Big Data/Analytics Zone is brought to you in partnership with:

Whitney has posted 73 posts at DZone. You can read more from them at their website. View Full User Profile

The Problem with Hadoop in HPC

05.31.2014
| 3667 views |
  • submit to reddit

When it comes to handling big data, Hadoop is a major player – but it doesn't seem to have much traction in the high-performance computing community. In a thoughtful and detailed blog post, high-end computing enthusiast Glenn K. Lockwood dissects this disparity, pointing out the result of Hadoop's commercial origins and intended use in non-scientific communities.

I think what makes Hadoop uncomfortable to the HPC community is that, unlike virtually every other technology that has found successful adoption within research computing, Hadoop was not designed by HPC people ... By contrast, Hadoop was developed by Yahoo, and the original MapReduce was developed by Google.  They were not created to solve problems in fundamental science or national defense; they were created to provide a service for the masses.

Hadoop is also written in Java, a decision that made sense in the context of commercial application and web services but that clashes with the supercomputing world.  Given the descriptor "high performance," Java tends to give off the opposite perception of being slow and inefficient or with performance issues.

The idea of running Java applications on supercomputers is beginning to look less funny nowadays with the explosion of cheap genome sequencing... With that being said though, Java is still a very strange way to interact with a supercomputer.  Java applications don't compile, look, or feel like normal applications in UNIX as a result of their cross-platform compatibility... For the vast majority of HPC users coming from traditional domain sciences and the professionals who support their infrastructure, Java applications remain unconventional and foreign.

Lockwood also points out that Hadoop re-invents the wheel in terms of functionality by taking technology that has existed within high-performance computing for decades in ways that frustrate supercomputing professionals.

...these poor reinventions are not the result of ignorance; rather, Hadoop's reinvention of a lot of HPC technologies arises from reason #1 above: Hadoop was not designed to run on supercomputers and it was not designed to fit into the existing matrix of technologies available to traditional HPC.  Rather, it was created to interoperate with web-oriented infrastructure.

The way in which Hadoop has evolved has been counter to how technologies develop from high-performance computing, which may be another source of frustration. It attempts to answer a question that doesn't exist in high-performance computing. 

The evolution of Hadoop has very much been a backwards one; it entered HPC as a solution to a problem which, by and large, did not yet exist.  As a result, it followed a common, but backwards, pattern by which computer scientists, not domain scientists, get excited by a new toy and invest a lot of effort into creating proof-of-concept codes and use cases.  Unfortunately, this sort of development is fundamentally unsustainable because of its nucleation in a vacuum, and in the case of Hadoop, researchers moved on to the next big thing and largely abandoned their model applications as the shine of Hadoop faded.

However, there are ways to help Hadoop fit more snugly into high-performance computing, which include but aren't limited to working with MapReduce to make it more high performance-oriented and implementing more high-performance computing technologies within Hadoop. Ultimately, it's about overcoming bias against its origins and working out kinks. While not front-and-center in supercomputing, Hadoop shouldn't be dismissed either – it could help solve specific (if as yet nonexistant) problems.

I think I have a pretty good idea about why Hadoop has received a lukewarm, and sometimes cold, reception in HPC circles, and much of these underlying reasons are wholly justified.  Hadoop's from the wrong side of the tracks from the purists' perspective, and it's not really changing the way the world will do its high-performance computing.  There is a disproportionate amount of hype surrounding it as a result of its revolutionary successes in the commercial data sector.

For more information, read Lockwood's original post here .

Published at DZone with permission of its author, Whitney Baker.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)