Big Data/Analytics Zone is brought to you in partnership with:

Gil has posted 1 posts at DZone. View Full User Profile

Hadoop 101: An Explanation of the Hadoop Ecosystem

08.20.2014
| 18980 views |
  • submit to reddit

Big data is taking off in 2014. More companies than ever are finding uses for it, both for managing everyday business routines and for finding solutions to complex business problems. It’s quickly moving away from it’s position as a hype word and establishing itself as a viable technology for businesses and entities both big and small.

Big data, simply put is the huge amounts of data that is all around us via smart devices, internet usage, social media, chat rooms, mobile apps, phone calls, purchasing history, and numerous other things. Big data technology gathers, stores and analyzes this huge amount of information, which is generally on the petabyte scale.

The technology is completely changing how people look at data and databases and how that data is used. In the military big data is being used to prevent injuries. In the NBA it’s being used to capture and analyze millions of individual movements during a game. The healthcare industry is using big data to fight cancer and heart disease. Car companies are using the technology to implement self-driving, auto-to-auto communicating cars.

Big data is changing the world. What, though, is the software behind all of this? What keeps the big data technology up and running?

Hadoop.

Many people assume that Hadoop is big data. It’s not. There was big data before Hadoop and there continues to be big data without Hadoop. However, Hadoop is a huge player now with big data. There’s a reason it’s synonymous with big data — so many people use it. You have your work cut out to find companies with big data who aren’t using some sort of Hadoop software. What exactly is Hadoop?

It’s a “software library” that gives users the ability to process “large data sets across clusters of computers using simple programming models.” In other words, it gives companies the capability to gather, store and analyze huge sets of data.

Additionally, an important part to understand about Hadoop is that it’s a “software library.” There’s a large library of programs that complement the base Hadoop framework and give companies the specific tools they need to get the desired Hadoop results.

Let’s take a look at the Hadoop ecosystem. This information and more can be found at Hadoop’s website.

There are modules contained within the Hadoop project — Hadoop Common, Hadoop Distributed File System, Hadoop YARN and Hadoop MapReduce. Together these systems give users the tools to support additional Hadoop projects that we’ll mention in below, along with the ability to process large data sets in real time while automatically scheduling jobs and managing cluster resources.

To complement the Hadoop modules there are also a variety of other projects that provide specialized services.

Apache Hive: “A data warehouse infrastructure that provides data summarization and ad hoc querying.” It’s a system that gives users the tools to make powerful queries and get results often in real-time.

Apache Spark: Apache Spark is a general compute engine that offers fast data analysis on a large scale. Spark is built on HDFS but bypasses MapReduce and instead uses its own data processing framework. Common uses cases for Apache Spark include real-time queries, event stream processing, iterative algorithms, complex operations and machine learning.

Apache Ambari: Ambari was created to help manage Hadoop. It offers support for many of the tools in the Hadoop ecosystem including Hive, HBase, Piq, Sqoop and Zookeeper. The tool features a management dashboard that keeps track of cluster health and can help diagnose performance issues.

Apache Pig: Pig is a platform with a high-level query language built to handle large data sets

Apache HBase: HBase is a non-relational database management system that runs on top of HDFS. It is built to handle sparse data sets common to big data projects.

Other common Hadoop projects include: Avro, Cassandra, Chukwa, Mahout, and Zookeeper.

By implementing Hadoop, users gain access to an amazing amount of tools and resources that allow them to truly personalize their big data experience to fit whatever their business needs may be.

Published at DZone with permission of its author, Gil Allouche. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)