Big Data/Analytics Zone is brought to you in partnership with:

Java/J2EE Big Data Professional interested exploring real world software applications. While I am not an expert on any given area, would like to share my journey of software developer by sharing my experiences. Hardik is a DZone MVB and is not an employee of DZone and has posted 4 posts at DZone. You can read more from them at their website. View Full User Profile

How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2, Part 2

01.28.2014
| 5106 views |
  • submit to reddit

In Part-1 we have successfully created, launched and connected to Amazon Ubuntu Instances. In Part-2 I will show how to install and setup Hadoop cluster. If you are seeing this page first time, I would strongly advise you to go over Part-1.

In this article

  • HadoopNameNode will be referred as master,
  • HadoopSecondaryNameNode will be referred as SecondaryNameNode or SNN
  • HadoopSlave1 and HadoopSlave2 will be referred as slaves (where data nodes will reside)

So, let’s begin.

1. Apache Hadoop Installation and Cluster Setup

1.1 Update the packages and dependencies.

Let’s update the packages , I will start with master , repeat this for SNN and 2 slaves.

$ sudo apt-get update

Once its complete, let’s install java

1.2 Install Java

Add following PPA and install the latest Oracle Java (JDK) 7 in Ubuntu

$ sudo add-apt-repository ppa:webupd8team/java

$ sudo apt-get update && sudo apt-get install oracle-jdk7-installer

Check if Ubuntu uses JDK 7

java_installaation

Repeat this for SNN and 2 slaves.

1.3 Download Hadoop

I am going to use haddop 1.2.1 stable version from apache download page and here is the 1.2.1 mirror

issue wget command from shell

$ wget http://apache.mirror.gtcomm.net/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz

download_hadoop

Unzip the files and review the package content and configuration files.

$ tar -xzvf hadoop-1.2.1.tar.gz

dir_listing

For simplicity, rename the ‘hadoop-1.2.1’ directory to ‘hadoop’ for ease of operation and maintenance.

$ mv hadoop-1.2.1 hadoop

dir_listing_2

1.4 Setup Environment Variable

Setup Environment Variable for ‘ubuntu’ user

Update the .bashrc file to add important Hadoop paths and directories.

Navigate to home directory

$cd

Open .bashrc file in vi edito

$ vi .bashrc

Add following at the end of file

export HADOOP_CONF=/home/ubuntu/hadoop/conf

export HADOOP_PREFIX=/home/ubuntu/hadoop

#Set JAVA_HOME

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

# Add Hadoop bin/ directory to path
export PATH=$PATH:$HADOOP_PREFIX/bin

Save and Exit.

To check whether its been updated correctly or not, reload bash profile, use following commands

source ~/.bashrc
echo $HADOOP_PREFIX
echo $HADOOP_CONF
Repeat 1.3 and 1.4  for remaining 3 machines (SNN and 2 slaves).

1.5 Setup Password-less SSH on Servers

Master server remotely starts services on salve nodes, whichrequires password-less access to Slave Servers. AWS Ubuntu server comes with pre-installed OpenSSh server.
Quick Note:
The public part of the key loaded into the agent must be put on the target system in ~/.ssh/authorized_keys. This has been taken care of by the AWS Server creation process
Now we need to add the AWS EC2 Key Pair identity haddopec2cluster.pem to SSH profile. In order to do that we will need to use following ssh utilities
  • ‘ssh-agent’ is a background program that handles passwords for SSH private keys.
  •  ‘ssh-add’ command prompts the user for a private key password and adds it to the list maintained by ssh-agent. Once you add a password to ssh-agent, you will not be asked to provide the key when using SSH or SCP to connect to hosts with your public key.

Amazon EC2 Instance  has already taken care of ‘authorized_keys’ on master server, execute following commands to allow password-less SSH access to slave servers.

First of all we need to protect our keypair files, if the file permissions are too open (see below) you will get an error

ssh_error

To fix this problem, we need to issue following commands

$ chmod 644 authorized_keys

Quick Tip: If you set the permissions to ‘chmod 644′, you get a file that can be written by you, but can only be read by the rest of the world.

$ chmod 400 haddoec2cluster.pem

Quick Tip: chmod 400 is a very restrictive setting giving only the file onwer read-only access. No write / execute capabilities for the owner, and no permissions what-so-ever for anyone else.

To use ssh-agent and ssh-add, follow the steps below:

  1. At the Unix prompt, enter: eval `ssh-agent`Note: Make sure you use the backquote ( ` ), located under the tilde ( ~ ), rather than the single quote ( ' ).
  2. Enter the command: ssh-add hadoopec2cluster.pem

if you notice .pem file has “read-only” permission now and this time it works for us.

ssh_success
Keep in mind ssh session will be lost upon shell exit and you have repeat ssh-agent and ssh-add commands.

Remote SSH

Let’s verify that we can connect into SNN and slave nodes from master

ssh_remote_connection

$ ssh ubuntu@<your-amazon-ec2-public URL>
On successful login the IP address on the shell will change.
Published at DZone with permission of Hardik Pandya, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)