Big Data/Analytics Zone is brought to you in partnership with:

Nikita Ivanov is a founder and CEO if GridGain Systems – developer of one of the most innovative real time big data platform in the world. I have almost 20 years of experience in software development, a vision and pragmatic view of where development technology is going, and high quality standards in software engineering and entrepreneurship. Nikita is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

GigaOm: 'Hadoop's days are numbered…' Are they?

  • submit to reddit
Interesting article at GigaOm: I won’t repeat the main points - but basically it says that since Hadoop is disk/ETL/batch based it won’t fit for real time processing of frequently changing data. The author correctly points out that real time processing (i.e. perceptual real time meaning sub-second to few seconds response time) is becoming a HUGE trend that’s impossible to ignore. He points to Google that moved away from Hadoop MapReduce-like approach towards massively distributed in-memory platform for its various projects like Precolator and Dremel...

So, What’s New?!

The wide spread confusion about Hadoop’s role and its applicability is becoming alarming... Hadoop was never designed to process anything in real time or process live streaming data or process anything that’s rapidly changing. Hadoop’s core is HDFS technology - a highly scalable distributed file system that works on spinning disks and supports effective batch storing and accessing data. It is an excellent data warehouse technology that scales to petabytes of data on commodity hardware. And Hadoop does an excellent job at this. Now, Hadoop eco-system also has MapReduce (and various satellite projects like Pig, Hive, etc.).

Hadoop’s implementation of MapReduce (as well as Pig, Hive, HBase, etc.) “suffers” from exactly the same limitation - it works over HDFS and therefore is architecturally a batch & disk oriented. Let me repeat it again - Hadoop MapReduce was never meant to processing anything in real time or work on live streaming data. Period, end of story. It was designed to work over datasets stored in disk-based HDFS - and it does so very well.

Are They Really Numbered?

I don’t see anything on the horizon that would displace Hadoop HDFS. There’s a clear business use case & demand for massive disk-based storage on petabyte/exabyte scale - and Hadoop HDFS is a clear industry choice today. Hadoop HDFS is here to stay for a long time... But as Gartner’s Merv Adrian says the Big Data has two sides to its coin: storage and processing. Hadoop HDFS provides excellent storage technology but its processing side isn’t as shiny. As I (and many others) have mentioned Hadoop MapReduce is bound to live by limitations of HDFS - batch & offline oriented disk-based processing. Some companies will be content with that limitations (and for many it is just fine). Others - will follow Facebook, Google and Twitter in moving away from disk-based, offline processing towards real time in-memory data platforms.

What is very important to understand is that move to in-memory processing isn’t about the raw speed only (although the RAM access is up to 10,000,000 times faster than disk). What’s more important is that when you keep your working set in memory it enables a complete new family of algorithms that you can employ. Incremental indexing (Google’s Precolator, GridGain’s Data Grid), streaming MapReduce/CEP (GridGain’s Compute Grid, Twitter’s Storm), etc. - all of these are not something that Hadoop engineers just didn’t know about - it is rather something that is largely enabled by in-memory technology.

Naturally, in-memory based technologies don’t invalidate the need for Hadoop HDFS, the proverbial data warehouse. In many cases (but not all) HDFS can happily coexist with something like GridGain that provides native upstream and downstream integration with HDFS enabling you to do streaming MapReduce/CEP processing on data in HDFS - among many other things. To sum up my thoughts I believe and hope that Hadoop HDFS is there to stay and we’ll see more and more companies moving away from disk-based processing towards all kinds of in-memory based technologies.
Published at DZone with permission of Nikita Ivanov, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


Mark Unknown replied on Thu, 2012/07/12 - 1:57pm

So how would GridGain solve this problem? 

 Also the thing to consider besides the engine itself is the ecosystem. I look through the pointed-to article and did some follow-on googling, to look at the possible replacements for real-time  Hadoop. I wonder if they have something like Mahout and NLPs designed to work with Hadoop (for example).

Nikita Ivanov replied on Thu, 2012/07/12 - 8:16pm in response to: Mark Unknown

Hadoop eco-system is probably its strongest point. Various projects (Mahout is one of them) is an excellent examples of that. But no ecosystem in the world can bring Hadoop to process anythign in real time, or streaming data - it simply wasn't designed for that. 

Other projects and products may not have the same cloud of eco-system like Hadoop - but they solve problems that are unsolveable by Hadoop-like systems (i.e. bactch, disk-based).



Mark Unknown replied on Fri, 2012/07/13 - 8:57am in response to: Nikita Ivanov

Right - I understand that Hadoop isn't for realtime.  I was just wondering if you had any insight into the eco-system of real-time products - i.e. your product.  The problem is that without the ecosystem, real-time products are basically unusable to the masses.  

Anyway, in the meantime I will be digging into all the OSS realtime products I found links to as a reasult of the article you referred to.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.