Big Data/Analytics Zone is brought to you in partnership with:

Machine learner and data scientist, Ph.D. from the University of Bonn in 2005, now working as a PostDoc at TU Berlin and as chief data scientist and co-founder at TWIMPACT, a startup focussing on real-time social media analysis. Mikio is a DZone MVB and is not an employee of DZone and has posted 34 posts at DZone. You can read more from them at their website. View Full User Profile

Apache Spark: The Next Big Data Thing?

01.20.2014
| 19515 views |
  • submit to reddit

Apache Spark is generating quite some buzz right now. Databricks, the company founded to support Spark raised $14M from Andreessen Horowitz, Cloudera has decided to fully support Spark, and others chime in that it’s the next big thing. So I thought it’s high time I took a look to get an understanding what the whole buzz is around.

I played around with the Scala API (Spark is written in Scala), and to be honest, at first I was pretty underwhelmed, because Spark looked, well, so small. The basic abstraction are Resilient Distributed Datasets (RDDs), basically distributed immutable collections, which can be defined based on local files or files stored in on Hadoop via HDFS, and which provide the usual Scala-style collection operations like map, foreach and so on.

My first reaction was “wait, is this basically distributed collections?” Hadoop in comparison seemed to be so much more, a distributed filesystem, obviously map reduce, with support for all kinds of data formats, data sources, unit testing, clustering variants, and so on and so on.

Others quickly pointed out that there’s more to it, in fact, Spark also provides more complex operations like joins, group-by, or reduce-by operations so that you can model quite complex data flows (without iterations, though).

Over time it dawned on me that the perceived simplicity of Spark actually said a lot more about the Java API of Hadoop than Spark. Even simple examples in Hadoop usually come with a lot of boilerplate code. But conceptually speaking, Hadoop is quite simple as it only provides two basic operations, a parallel map, and a reduce operation. If expressed in the same way on something resembling distributed collections, one would in fact have an even smaller interface (some projects like Scalding actually build such things and the code looks pretty similar to that of Spark).

So after convincing me that Spark actually provides a non-trivial set of operations (really hard to tell just from the ubiqitous word count example), I digged deeper and read this paper which describes the general architecture. RDDs are the basic building block of Spark and are actually really something like distributed immutable collections. These define operations like map or foreach which are easily parallelized, but also join operations which take two RDDs and collects entries based on a common key, as well as reduce-by operations which aggregates entries using a user specified function based on a given key. In the word count example, you’d map a text to all the words with a count of one, and then reduce them by key using the word and summing up the counts to get the word counts. RDDs can be read from disk but are then held in memory for improved speed where they can also be cached so you don’t have to reread them every time. That alone adds a lot of speed compared to Hadoop which is mostly disk based.

Now what’s interesting is Spark’s approach to fault tolerance. Instead of persisting or checkpointing intermediate results, Spark remembers the sequence of operations which led to a certain data set. So when a node fails, Spark reconstructs the data set based on the stored information. They argue that this is actually not that bad because the other nodes will help in the reconstruction.

So in essence, compared to bare Hadoop, Spark has a smaller interface (which might still become similarly bloated in the future), but there are many projects on top of Hadoop (like Twitter’s Scalding, for example), which achieve a similar level of expressiveness. The other main difference is that Spark is in-memory by default, which naturally leads to a large improvement in performance, and even allows to run iterative algorithms. Spark has no built-in support for iterations, though, it’s just that they claim it’s so fast that you can run iterations if you want to.

Spark Streaming - return of the micro-batch

Spark also comes with a streaming data processing model, which got me quite interested, of course. There is again a paper which summarizes the design quite nicely. Spark follows an interesting and different approach compared to frameworks like Twitter’s Storm. Storm is basically like a pipeline where you push individual events in which then get processed in a distributed fashion. Instead, Spark follows a model where events are collected and then processed at short time intervals (let’s say every 5 seconds) in a batch manner. The collected data become an RDD of their own which is then processed using the usual set of Spark applications.

The authors claim that this mode is more robust against slow nodes and failures, and also that the 5 second interval are usually fast enough for most applications. I’m not so sure about this, as distributed computing is always pretty complex and I don’t think you can easily say that something’s are generally better than others. This approach also nicely unifies the streaming with the non- streaming parts, which is certainly true.

Final thoughts

What I saw looked pretty promising, and given the support and attention Spark receives, I’m pretty sure it will mature and become a strong player in the field. It’s not well-suited for everything. As the authors themselves admit, it’s not really well suited to operations which require to change only a few entries the data set at the time due to the immutable nature of the RDDs. In principle, you have to make a copy of the whole data set even if you just want to change one entry. This can be nicely paralellized, but is of course costly. More efficient implementations based on copy-on-write schemes might also work here, but are not implement yet if I’m not mistaken.

Stratosphere is research project at the TU Berlin which has similar goals, but takes the approach even further by including more complex operations like iterations and not only storing the sequence of operations for fault tolerance, but to use them for global optimization of the scheduling and paralellization.

Immutability is pretty on vogue here as it’s easier to reason about, but I’d like to point you to this excellent article by Baron Schwartz on how you’ll always end up with mixed strategies (mutable and immutable data) to make it work in the real-world.

Published at DZone with permission of Mikio Braun, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)