Big Data/Analytics Zone is brought to you in partnership with:

Bill Bejeck is a software developer and enjoys the challenges that software development brings. Bill also loves exploring new languages, technologies and learning and educating by blogging. Bill is a DZone MVB and is not an employee of DZone and has posted 29 posts at DZone. You can read more from them at their website. View Full User Profile

Configuring Hadoop with Guava MapSplitters

10.07.2013
| 4612 views |
  • submit to reddit


hadoop-logoIn this post, we are going to provide a new twist on passing configuration parameters to a Hadoop Mapper via the Context object. Typically, we set configuration parameters as key/value pairs on the Context object when starting a map-reduce job. Then in the Mapper, we use the key(s) to retrieve the value(s) to use for our configuration needs. The twist is that we will set a specially formatted string on the Context object, and when retrieving the value in the Mapper, use a Guava MapSplitter to convert the formatted string into a HashMap that will be used for obtaining configuration parameters. We may be asking ourselves, "why go to this trouble?" By doing configuration this way, we are able to pass multiple parameters to a Mapper with a single key-value pair set on the Context object. To illustrate one possible usage, we will revisit the last post, where we covered how to perform reduce-side joins. There are two problems with the proposed solution in that post. First, we are assuming the key to join on is always the first value in a delimited string from a file. Second, we assume that the same delimiter is used for each file. What if we want to join data from files where the key is located in different locations per file and some files use different delimiters? Additionally, we want to use the same delimiter (if any) for all of the data we output, regardless of the delimiter used in any of the input files. While this is admittedly a contrived situation, it will serve well for demonstration purposes. First let’s go over what the MapSplitter class is and how we can use it.

MapSplitter

The MapSplitter is a nested class in the Splitter class. The Spitter takes a String and splits it into parts with a given delimiter. The MapSplitter goes a step further by creating a Map<String,String> from a string with key-value pairs separated with one delimiter, and the pairs themselves separated with another delimiter altogether. Let’s take a look at an example:

Map<String,String> configParams = Splitter.splitOn("#")
	                                   .withKeyValueSeparator("=")
	                                   .split("6=June#7=July#8=August");

In the example above, the String "6=June#7=July#8=August" would be converted into a Map with keys 6, 7, and 8 mapped to June, July, and August respectively. The MapSplitter is a very simple, but powerful class. Now that we know how the MapSplitter works, let’s take a look at how we can use it to help us set configuration parameters for our map-reduce jobs.

Configuration Using the MapSplitter

Previously, we set the index position of our join key and the delimiter to split on as the same for all files by setting the values in the Context object for the map-reduce job. Now we want to be able to set those on a per input file basis as needed. We will still have default values as needed. In order to accomplish this change we will create a properties file that will have the name of the file as the key and the value will be a string formatted to be used by the MapSplitter. Our properties file will look like this:

oneToManyEmployer2.txt=keyIndex=1&separator=|
oneToManyVehicles2.txt=keyIndex=1&separator=#

Here we are indicating that the file oneToManyEmployer2.txt has our join key at index position 1 and the delimiter is a “|” pipe character, and the oneToManyVehicles2.txt file has the join key at index position 1 and uses a “,” comma as a delimiter. We are going to make a few changes to our driver class. First we are going to load the properties file (assuming we have placed the file in a directory located relative to where we invoke Hadoop).

InputStream inputStream = new FileInputStream(new File("./jobs/join-config.properties"));
Properties properties = new Properties();
properties.load(inputStream);

First we define a regular Splitter object that will split the file names on a forward slash ‘/’. Next, as we loop through the file names, we get the base name of the file by calling Iterables.getLast on the Iterable object returned from the Splitter.split method call. Then we attempt to retrieve a configured property string for each file in the Properties.getProperty method. Take note we also pass the defaultMapConfig variable that supplies default values if the property is not found for the file. We’ve also added some additional configuration keys and values; the delimiter to use when joining values together and the join order for the file, which is determined by it’s position in the arguments provided to the program. Then we simply place the formatted string into the Context object using the file name as the key.

String defaultMapConfig = "keyIndex=0&separator=,";
Splitter splitter = Splitter.on('/');
for (int i = 0; i < args.length - 1; i++) {
    String fileName = Iterables.getLast(splitter.split(args[i]));
    String mapConfig = properties.getProperty(fileName, defaultMapConfig);
    builder.append(mapConfig).append("&joinDelimiter=,&joinOrder=").append(i + 1);
    config.set(fileName, builder.toString());
    builder.setLength(0);
    filePaths.append(args[i]).append(",");
}

Using the Configuration Values

To use our config values we will first have to retrieve the HashMap stored as a String that contains our configuration parameters.

private Splitter.MapSplitter mapSplitter = Splitter.on("&").withKeyValueSeparator("=");
.......
private Map<String,String> getConfigurationMap(Context context){
        FileSplit fileSplit = (FileSplit)context.getInputSplit();
        String configString = context.getConfiguration().get(fileSplit.getPath().getName());
        return mapSplitter.split(configString);
    }

Here we are using a MapSplitter instance variable to create our HashMap by retrieving the formatted string stored in the Context with the file name used by this Mapper. Now we can simply pull the needed configuration parameters out from the map as shown here in the setup method:

protected void setup(Context context) throws IOException, InterruptedException {
        Map<String,String> configMap = getConfigurationMap(context);
        keyIndex = Integer.parseInt(configMap.get("keyIndex"));
        String separator = configMap.get("separator");
        splitter = Splitter.on(separator).trimResults();
        String joinDelimiter = configMap.get("joinDelimiter");
        joiner = Joiner.on(joinDelimiter);
        joinOrder = Integer.parseInt(configMap.get("joinOrder"));
    }

The code in the map method remains the same as it was from our previous post
Now we have completely configurable settings per file and we aren't restricted to having the join key in one position or having to use the same delimiter per file. Of course this is just one example, but the approach outlined here could be used to configure many other settings, and it only requires one key in the Context object.

Results

Initially, our data looked like this:

oneToManyEmployer2.txt:


Creative Wealth|cdd8dde3-0349-4f0d-b97a-7ae84b687f9c
Susie's Casuals|81a43486-07e1-4b92-b92b-03d0caa87b5f
Super Saver Foods|aef52cf1-f565-4124-bf18-47acdac47a0e
.....
oneToManyVehicles2.txt:

2003 Holden Cruze#cdd8dde3-0349-4f0d-b97a-7ae84b687f9c
2012 Volkswagen T5#81a43486-07e1-4b92-b92b-03d0caa87b5f
2009 Renault Trafic#aef52cf1-f565-4124-bf18-47acdac47a0e
.....
singlePersonRecords.txt:

cdd8dde3-0349-4f0d-b97a-7ae84b687f9c,Esther,Garner,4071 Haven Lane,Okemos,MI
81a43486-07e1-4b92-b92b-03d0caa87b5f,Timothy,Duncan,753 Stadium Drive,Taunton,MA
aef52cf1-f565-4124-bf18-47acdac47a0e,Brett,Ramsey,4985 Shinn Street,New York,NY
......

After running our map-reduce job, our results look exactly like we want them:

08db7c55-22ae-4199-8826-c67a5689f838,John,Gregory,258 Khale Street,Florence,SC,2010 Nissan Titan,Ellman's Catalog Showrooms
0c521380-f868-438c-9916-4ab4ea76d316,Robert,Eversole,518 Stratford Court,Fayetteville,NC,2002 Toyota Highlander,Specialty Restaurant Group
1303e8a6-0085-45b1-8ea5-26c809635da1,Joe,Nagy,3438 Woodstock Drive,El Monte,CA,2011 Hyundai ix35,Eagle Food Centers
15360125-38d6-4f1e-a584-6ab9d1985ab8,Sherri,Hanks,4082 Old House Drive,Alexandria,OH,2003 Toyota Solara,Odyssey Records & Tapes
......

Resources

Published at DZone with permission of Bill Bejeck, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)