Big Data/Analytics Zone is brought to you in partnership with:

Lijin Joseji is a Senior IT Specialist working with IBM Global Business Services since 2008. He has been involved in different projects which make use of WebSphere eXtream deployment components as well as other open source technologies. His areas of expertise includes design and development of J2EE applications, WebSphere eXtream Deployment Components such as IBM Object Grid, Compute Grid, SOA Architecture, open source frameworks such as Spring, Hibernate, Web service frameworks, NoSQL & SQL Databases and mobile development. Currently he works and specializes in WebSphere eXtreme Scale and WebSphere Extended Deployment Compute Grid and Object Grid related Projects, NoSQL dabatases, Android Development and Cloud computing. He used to write his technical views and experience through his Blog called OrangeSlate.com. Lijin is a DZone MVB and is not an employee of DZone and has posted 6 posts at DZone. You can read more from them at their website. View Full User Profile

6 sparkling features of Apache Spark!

08.05.2014
| 5296 views |
  • submit to reddit

What is Apache Spark? Why there is a serious buzz going-on about this? If you are in the Big Data analytics business, should you really care about Spark? I hope this post will help to answer some of these questions which might have coming to your mind these days.

Apache Spark is a powerful open source processing engine for Hadoop data built around speed, easy to use, and sophisticated analytics. It was originally developed in UC Berkeley’s AMPLab and later-on it moved to Apache. Apache Spark is basically a parallel data processing framework that can work with Apache Hadoop to make it extremely easy to develop fast, Big Data applications combining batch, streaming, and interactive analytics on all your data.

Lets go through some of its features which are really highlighting it in the Bigdata world!

  1. Lighting Fast Processing

When comes to BigData processing speed always matters. We always look for processing our huge data as fast as possible. Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write –  the main time consuming factors – of data processing.

(Spark Performance over Hadoop. Image Courtesy: Cloudera. Visit this link to see how Jai & Matei explains the delightful experience giving by Spark to its developers.)

  1. Ease of Use as it supports multiple languages

Spark lets you quickly write applications in JavaScala, or Python. This helps developers to create and run their applications on their familiar programming languages. It comes with a built-in set of over 80 high-level operators.We can use it interactively to query data within the shell too.

  1. Support for Sophisticated Analytics

In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a single workflow.

  1. Real time stream processing

Spark can handle real time streaming. Map-reduce mainly handles and process the data stored already. However Spark can also manipulate data in real time using Spark Streaming. Not ignoring that there are other frameworks with their integration we can handle streaming in Hadoop.

Here is what Cloudera says about Sparks Streaming abilities:

  • Easy: Built on Spark’s lightweight yet powerful APIs, Spark Streaming lets you rapidly develop streaming applications
  • Fault tolerant: Unlike other streaming solutions (e.g. Storm), Spark Streaming recovers lost work and delivers exactly-once semantics out of the box with no extra code or configuration
  • Integrated: Reuse the same code for batch and stream processing, even joining streaming data to historical data

(Streaming Performance over Storm. Image Courtesy:Cloudera.com)

  1. Ability to integrate with Hadoop and existing HadoopData

Spark can run independently. Apart from that it can run on Hadoop 2’s YARN cluster manager, and can read any existing Hadoop data. That’s a BIG advantage! It can read from any Hadoop data sources for example HBase, HDFS etc. This feature of Spark makes it suitable for migration of existing pure Hadoop applications, if that application use-case is really suiting Spark. As Spark is using immutability more all scenarios might not be suitable for migration.

  1. Active and expanding Community

Apache Spark is built by a wide set of developers from over 50 companies. The project started in 2009 and as of now more than 250 developers have contributed to Spark already! It has active mailing lists and JIRA for issue tracking.

Below are some useful links to start with:

If you want to learn basics of Apache Spark then my previous post will help you. It has a training video link which explains Spark simple way.

Published at DZone with permission of Lijin Joseji, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)