How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2, Part 1
After spending some time playing around on Single-Node pseudo-distributed cluster, it's time to get into real world hadoop. Depending on what works best – Its important to note that there are multiple ways to achieve this and I am going to cover how to setup multi-node hadoop cluster on Amazon EC2. We are going to setup 4 node hadoop cluster as below.
- NameNode (Master)
- DataNode (Slave1)
- DataNode (Slave2)
Here’s what you will need
- Amazon AWS Account
- PuTTy Windows Client (to connect to Amazon EC2 instance)
- PuTTYgen (to generate private key – this will be used in putty to connect to EC2 instance)
- WinSCP (secury copy)
This will be a two part series
In Part-1 I will cover infrastructure side as below
- Setting up Amazon EC2 Instances
- Setting up client access to Amazon Instances (using Putty.)
- Setup WinSCP access to EC2 instances
In Part-2 I will cover the hadoop multi node cluster installation
- Hadoop Multi-Node Installation and setup
1. Setting up Amazon EC2 Instances
With 4 node clusters and minimum volume size of 8GB there would be an average $2 of charge per day with all 4 running instances. You can stop the instance anytime to avoid the charge, but you will loose the public IP and host and restarting the instance will create new ones,. You can also terminate your Amazon EC2 instance anytime and by default it will delete your instance upon termination, so just be careful what you are doing.
1.1 Get Amazon AWS Account
If you do not already have a account, please create a new one. I already have AWS account and going to skip the sign-up process. Amazon EC2 comes with eligible free-tier instances.
1.2 Launch Instance
Once you have signed up for Amazon account. Login to Amazon Web Services, click on My Account and navigate to Amazon EC2 Console
1.3 Select AMI
I am picking Ubuntu Server 12.04.3 Server 64-bit OS
Select the micro instance
1.5 Configure Number of Instances
As mentioned we are setting up 4 node hadoop cluster, so please enter 4 as number of instances. Please check Amazon EC2 free-tier requirements, you may setup 3 node cluster with < 30GB storage size to avoid any charges. In production environment you want to have SecondayNameNode as separate machine
1.6 Add Storage
Minimum volume size is 8GB
1.7 Instance Description
Give your instance name and description
1.8 Define a Security Group
Create a new security group, later on we are going to modify the security group with security rules.
1.9 Launch Instance and Create Security Pair
Review and Launch Instance.
Amazon EC2 uses public–key cryptography to encrypt and decrypt login information. Public–key cryptography uses a public key to encrypt a piece of data, such as a password, then the recipient uses the private key to decrypt the data. The public and private keys are known as a key pair.
Create a new keypair and give it a name “hadoopec2cluster” and download the keypair (.pem) file to your local machine. Click Launch Instance
1.10 Launching Instances
Once you click “Launch Instance” 4 instance should be launched with “pending” state
Once in “running” state we are now going to rename the instance name as below.
- HadoopNameNode (Master)
- HadoopSlave1 (data node will reside here)
- HaddopSlave2 (data node will reside here)
Please note down the Instance ID, Public DNS/URL (ec2-54-209-221-112.compute-1.amazonaws.com) and Public IP for each instance for your reference.. We will need it later on to connect from Putty client. Also notice we are using “HadoopEC2SecurityGroup”.
You can use the existing group or create a new one. When you create a group with default options it add a rule for SSH at port 22.In order to have TCP and ICMP access we need to add 2 additional security rules. Add ‘All TCP’, ‘All ICMP’ and ‘SSH (22)’ under the inbound rules to “HadoopEC2SecurityGroup”. This will allow ping, SSH, and other similar commands among servers and from any other machine on internet. Make sure to “Apply Rule changes” to save your changes.
These protocols and ports are also required to enable communication among cluster servers. As this is a test setup we are allowing access to all for TCP, ICMP and SSH and not bothering about the details of individual server port and security.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)