NoSQL Zone is brought to you in partnership with:

Brian has 10+ years of experience as a technology leader and architect in a wide variety of settings from early startups to Fortune 500 companies. With experience delivering SaaS solutions in business intelligence, artificial intelligence and VoIP, his current focus is big data and analytics. Brian leads the Virgil project on Apache Extras, which is a services layer built on Cassandra that provides REST, Map/Reduce, Search and Distributed Processing capabilities. Brian is a DZone MVB and is not an employee of DZone and has posted 62 posts at DZone. You can read more from them at their website. View Full User Profile

Programmatically submitting jobs to a remote Hadoop Cluster

12.28.2011
| 5229 views |
  • submit to reddit

I'm adding the ability to deploy a Map/Reduce job to a remote Hadoop cluster in Virgil. With this, Virgil allows users to make a REST POST to schedule a Hadoop job. (pretty handy)

To get this to work properly, Virgil needed to be able to remotely deploy a job. Ordinarily, to run a job against a remote cluster you issue a command from the shell:

hadoop jar $JAR_FILE $CLASS_NAME 


We wanted to do the same thing, but from within the Virgil runtime. It was easy enough to find the class we needed to use: RunJar. RunJar's main() method stages the jar and submits the job. Thus, to achieve the same functionality as the command line, we used the following:

List args = new ArrayList(); 
 args.add(locationOfJarFile); 
 args.add(className); 
 RunJar.main(args.toArray(new String[0])); 

 
That worked just fine, but would result in a local job deployment. To get it to deploy to a remote cluster, we needed Hadoop to load the cluster configuration. For Hadoop, cluster configuration is spread across three files: core-site.xml, hdfs-site.xml, and mapred-site.xml. To get the Hadoop runtime to load the configuration, you need to include these files on your classpath. The key line is found in the configuration Hadoop Javadoc.

"Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:"


Once we dropped the cluster configuration onto the classpath, everything worked like a charm.

From http://brianoneill.blogspot.com/2011/12/programmatically-submitting-jobs-to.html

Published at DZone with permission of Brian O' Neill, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: