DevOps Zone is brought to you in partnership with:

As a pasionate software developer, motivated by learning and appliyng innovative and interesting software development tools, techniques and methodologies, my professional objectives are the following. To be in a technology oriented enterprise where the technichal staff is the soul of the company. To be in an important IT team. Be able to design and develop state of the art software. Be able to apply new knowledge everyday, on innovative ways and with a great degree of freedom. To architect, design and develop software that uses the best practices of the field. Play with the latest technologies, learn everyday and participate in the research and innovation of the software products. Carlo is a DZone MVB and is not an employee of DZone and has posted 15 posts at DZone. You can read more from them at their website. View Full User Profile

Setting up a Hadoop Virtual Cluster with Vagrant

09.09.2012
| 11692 views |
  • submit to reddit
Usually for testing and using virtual machines, I go online, download the iso image of the machine I want to install, start Virtual Box, tell it to init from the iso, and install the OS manually, and then install the applications I want to use. It is a boring and tedious process that I never really cared for.  However, recently I discovered the power of Vagrant and also Puppet. They allow me to automate all the steps I previously did manually.

Here I test drive the process of automatically configuring a Hadoop cluster in virtual machines for a fully distributed mode.

First of all make sure you have Ruby installed. I’m testing with Ruby 1.9.3. You should also have Virtual Box installed. I have version 4.1.

Then from the command line install the vagrant gem:

gem install vagrant

Vagrant is a great tool that allow us to manage our Virtual Box machines using the command line and simple configuration files.

First we will install a linux Ubuntu virtual machine (or a box as it is called in vagrant)

vagrant box add base-hadoop http://files.vagrantup.com/lucid64.box

Then we go to a directory where we want to have our “workspace” and also the directory to create the vagrant configuration file for our new box and execute. This will create a Vagrantfile file with the vagrant configuration.

vagrant init base-hadoop

The virtual machine is ready to be started up now. You can start it by doing:

vagrant up

That is the virtual machine running. You can connect to it with ssh. type

vagrant ssh

Next step is to download Puppet. Do that going to the URL http://puppetlabs.com/misc/download-options/

Puppet is a tool that allow us to automate the process of provisioning servers. We will use it to manage our virtual machines, installing the required software on them and executing the required services.

So we create a directory where we are going to put our manifests (puppet configuration files)

mkdir manifests

in that new directory we create a file called base-hadoop.pp with the following content:

group { "puppet":
  ensure => "present",
}
 
In the Vagrantfile file that got created previously we uncomment the lines that look like:

config.vm.provision :puppet do |puppet|
     puppet.manifests_path = "manifests"
     puppet.manifest_file  = "base-hadoop.pp"
  end


The next thing we need to do is tell puppet to install Java in our servers. for that we open the base-hadoop.pp file and add the following:

exec { 'apt-get update':
    command => 'apt-get update',
}

package { "openjdk-6-jdk" :
   ensure => present
  require => Exec['apt-get update']
}


Next thing we need to install hadoop. For this we will create a new puppet module. A puppet module is used to encapsulate resources that belong to the same component.

We execute

mkdir -p modules/hadoop/manifests

Then we create an init.pp in this new manifests directory with the following content:

class hadoop {
 $hadoop_home = "/opt/hadoop"

exec { "download_hadoop":
command => "wget -O /tmp/hadoop.tar.gz http://apache.mirrors.timporter.net/hadoop/common/hadoop-1.0.3/hadoop-1.0.3.tar.gz",
path => $path,
unless => "ls /opt | grep hadoop-1.0.3",
require => Package["openjdk-6-jdk"]
}

exec { "unpack_hadoop" :
  command => "tar -zxf /tmp/hadoop.tar.gz -C /opt",
  path => $path,
  creates => "${hadoop_home}-1.0.3",
  require => Exec["download_hadoop"]
}
}


We have done a few things here, and they are almost self-explanatory. We are basically setting a variable to point to our hadoop installation. We are downloading Hadoop’s binaries from its Apache location and we are extracting it into the specified hadoop_home directory.

We need to add our new module to the main puppet configuration file. We add the following line at the top of the base-hadoop.pp file:

include hadoop

Then we add this new modules path to our Vagrantfile. So now our puppet section looks like:

config.vm.provision :puppet do |puppet|
     puppet.manifests_path = "manifests"
     puppet.manifest_file  = "base-hadoop.pp"
     puppet.module_path = "modules"
  end


We execute the following to reload the vagrant machine:

vagrant reload

That command will reload the vagrant machine and execute the puppet recipes. That will install the required software needed.

We will need a cluster of virtual machines. Vagrant supports that. We open our Vagrantfile and replace the content with the following:

Vagrant::Config.run do |config|
  config.vm.box = "base-hadoop"
  config.vm.provision :puppet do |puppet|
     puppet.manifests_path = "manifests"
     puppet.manifest_file  = "base-hadoop.pp"
     puppet.module_path = "modules"
  end
 
  config.vm.define :master do |master_config|
    master_config.vm.network :hostonly, "192.168.1.10"
  end

  config.vm.define :backup do |backup_config|
    backup_config.vm.network :hostonly, "192.168.1.11"
  end
 
  config.vm.define :hadoop1 do |hadoop1_config|
    hadoop1_config.vm.network :hostonly, "192.168.1.12"
  end
 
  config.vm.define :hadoop2 do |hadoop2_config|
    hadoop2_config.vm.network :hostonly, "192.168.1.13"
  end
 
  config.vm.define :hadoop3 do |hadoop3_config|
    hadoop3_config.vm.network :hostonly, "192.168.1.14"
  end
end


After this we execute:

vagrant up

That will start and it will provision all the servers. That will take a while.

But we are not ready. Next we need to configure the hadoop cluster. In the directory modules/hadoop we create another directory called files. Here we will create the needed configuration files for our hadoop cluster.

we create the following files:

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 <configuration>
  <property>
   <name>fs.default.name</name>
   <value>hdfs://master:9000</value>
   <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation.</description>
  </property>
 </configuration>


hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
  <name>dfs.replication</name>
  <value>3</value>
  <description>The actual number of replications can be specified when the file is created.</description>
 </property>
</configuration>

 

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
  <name>mapred.job.tracker</name>
  <value>master:9001</value>
  <description>The host and port that the MapReduce job tracker runs at.</description>
 </property>
</configuration>

 

masters

192.168.1.11

slaves

192.168.1.12 192.168.1.13 192.168.1.14

We then need to tell puppet to copy these files to our cluster. So we modify our init.pp file in the hadoop puppet module to contain the following:

class hadoop {
 $hadoop_home = "/opt/hadoop"

exec { "download_hadoop":
command => "wget -O /tmp/hadoop.tar.gz http://apache.mirrors.timporter.net/hadoop/common/hadoop-1.0.3/hadoop-1.0.3.tar.gz",
path => $path,
unless => "ls /opt | grep hadoop-1.0.3",
require => Package["openjdk-6-jdk"]
}

exec { "unpack_hadoop" :
  command => "tar -zxf /tmp/hadoop.tar.gz -C /opt",
  path => $path,
  creates => "${hadoop_home}-1.0.3",
  require => Exec["download_hadoop"]
}
file {
  "${hadoop_home}-1.0.3/conf/slaves":
  source => "puppet:///modules/hadoop/slaves",
  mode => 644,
  owner => root,
  group => root,
  require => Exec["unpack_hadoop"]
 }
 
file {
  "${hadoop_home}-1.0.3/conf/masters":
  source => "puppet:///modules/hadoop/masters",
  mode => 644,
  owner => root,
  group => root,
  require => Exec["unpack_hadoop"]
 }

file {
  "${hadoop_home}-1.0.3/conf/core-site.xml":
  source => "puppet:///modules/hadoop/core-site.xml",
  mode => 644,
  owner => root,
  group => root,
  require => Exec["unpack_hadoop"]
 }
 
file {
  "${hadoop_home}-1.0.3/conf/mapred-site.xml":
  source => "puppet:///modules/hadoop/mapred-site.xml",
  mode => 644,
  owner => root,
  group => root,
  require => Exec["unpack_hadoop"]
 }
 
 file {
  "${hadoop_home}-1.0.3/conf/hdfs-site.xml":
  source => "puppet:///modules/hadoop/hdfs-site.xml",
  mode => 644,
  owner => root,
  group => root,
  require => Exec["unpack_hadoop"]
 }
}

 

We then execute:

vagrant provision

And we get these files copied to all our servers.

We need to setup ssh password-less communication between our servers. We modify our hadoop-base.pp and leave it like this:

file {
  "/root/.ssh/id_rsa":
  source => "puppet:///modules/hadoop/id_rsa",
  mode => 600,
  owner => root,
  group => root,
  require => Exec['apt-get update']
 }
 
file {
  "/root/.ssh/id_rsa.pub":
  source => "puppet:///modules/hadoop/id_rsa.pub",
  mode => 644,
  owner => root,
  group => root,
  require => Exec['apt-get update']
 }

ssh_authorized_key { "ssh_key":
    ensure => "present",
    key    => "AAAAB3NzaC1yc2EAAAADAQABAAABAQCeHdBPVGuSPVOO+n94j/Y5f8VKGIAzjaDe30hu9BPetA+CGFpszw4nDkhyRtW5J9zhGKuzmcCqITTuM6BGpHax9ZKP7lRRjG8Lh380sCGA/691EjSVmR8krLvGZIQxeyHKpDBLEmcpJBB5yoSyuFpK+4RhmJLf7ImZA7mtxhgdPGhe6crUYRbLukNgv61utB/hbre9tgNX2giEurBsj9CI5yhPPNgq6iP8ZBOyCXgUNf37bAe7AjQUMV5G6JMZ1clEeNPN+Uy5Yrfojrx3wHfG40NuxuMrFIQo5qCYa3q9/SVOxsJILWt+hZ2bbxdGcQOd9AXYFNNowPayY0BdAkSr",
    type   => "ssh-rsa",
    user   => "root",
    require => File['/root/.ssh/id_rsa.pub']
}
 


We are ready to run our hadoop cluster now. For that, once again we modify the init.pp file in the hadoop puppet module, we add the following at the end, before closing the hadoop class:

 file {
  "${hadoop_home}-1.0.3/conf/hadoop-env.sh":
  source => "puppet:///modules/hadoop/hadoop-env.sh",
  mode => 644,
  owner => root,
  group => root,
  require => Exec["unpack_hadoop"]
 }

 

The haddop-env.sh file is the original one but we have uncommented the JAVA_HOME setting and pointed it to the correct Java installation.

We can give different names to each host in the Vagrantfile. For that we replace its contents with the following:
Vagrant::Config.run do |config|
  config.vm.box = "base-hadoop"
  config.vm.provision :puppet do |puppet|
     puppet.manifests_path = "manifests"
     puppet.manifest_file  = "base-hadoop.pp"
     puppet.module_path = "modules"
  end
 
  config.vm.define :backup do |backup_config|
    backup_config.vm.network :hostonly, "192.168.1.11"
    backup_config.vm.host_name = "backup"
  end
 
  config.vm.define :hadoop1 do |hadoop1_config|
    hadoop1_config.vm.network :hostonly, "192.168.1.12"
    hadoop1_config.vm.host_name = "hadoop1"
  end
 
  config.vm.define :hadoop2 do |hadoop2_config|
    hadoop2_config.vm.network :hostonly, "192.168.1.13"
    hadoop2_config.vm.host_name = "hadoop2"
  end
 
  config.vm.define :hadoop3 do |hadoop3_config|
    hadoop3_config.vm.network :hostonly, "192.168.1.14"
    hadoop3_config.vm.host_name = "hadoop3"
  end

  config.vm.define :master do |master_config|
    master_config.vm.network :hostonly, "192.168.1.10"
    master_config.vm.host_name = "master"
  end

end


Let’s do “vagrant reload” and wait for all systems to reload.

We have provisioned our systems. Let’s go to our master node and start everything:

vagrant ssh master

then when we are logged in we go to /opt/hadoop-1.0.3/bin

and do:

sudo ./hadoop namenode -format

sudo ./start-all.sh

We have now started our hadoop cluster. Now we can visit http://192.168.1.10:50070/ to access our master node and see that our hadoop cluster is indeed running.

All the files for this example (except for the box itself) exist in git@github.com:calo81/vagrant-hadoop-cluster.git for free use.
Published at DZone with permission of Carlo Scarioni, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)