Simple analytics using MapReduce

Simple analytics using MapReduce


Aggregative values (for example, Mean, Max, Min, standard deviation, and so on) provide the basic analytics about a dataset. You may perform these calculations, either for the whole dataset or a part of the dataset.

In this recipe, we will use Hadoop to calculate the minimum, maximum, and average size of a file downloaded from the NASA servers, by processing the NASA weblog dataset. The following figure shows a summary of the execution:

As shown in the figure, mapper task will emit all message sizes under the key msgSize, and they are all sent to a one-reducer job. Then the reducer will walk through all of the data and will calculate the aggregate values.

Getting ready

  • This recipe assumes that you have followed the first chapter and have installed Hadoop. We will use HADOOP_HOME to refer to the Hadoop installation folder.

  • Start Hadoop by following the instructions in the first chapter.

  • This recipe assumes that you are aware of how Hadoop processing works. If you have not already done so, you should follow the recipe Writing a WordCount MapReduce sample, bundling it and running it using standalone Hadoop from Chapter 1, Getting Hadoop Up and Running in a Cluster.

How to do it...

The following steps describe how to use MapReduce to calculate simple analytics about the weblog dataset:

  1. Download the weblog dataset from ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz and unzip it. We call the extracted folder as DATA_DIR.

  2. Upload the data to HDFS by running the following commands from HADOOP_HOME. If /data is already there, clean it up:

    >bin/hadoopdfs -mkdir /data
    > bin/hadoopdfs -mkdir /data/input1
    > bin/hadoopdfs -put <DATA_DIR>/NASA_access_log_Jul95 /data/input1
    
  3. Unzip the source code of this chapter (chapter6.zip). We will call that folder CHAPTER_6_SRC.

  4. Change the hadoop.home property in the CHAPTER_6_SRC/build.xml file to point to your Hadoop installation folder.

  5. Compile the source by running the ant build command from the CHAPTER_6_SRC folder.

  6. Copy the build/lib/hadoop-cookbook-chapter6.jar to your HADOOP_HOME.

  7. Run the MapReduce job through the following command from HADOOP_HOME:

    >bin/hadoop jar hadoop-cookbook-chapter6.jar chapter6.WebLogMessageSizeAggregator/data/input1 /data/output1
    
  8. Read the results by running the following command:

    $bin/hadoopdfs -cat /data/output1/*
    

    You will see that it will print the results as following:

    Mean    1150
    Max     6823936
    Min     0
    

How it works...

You can find the source for the recipe from src/chapter6/WebLogMessageSizeAggregator.java.

HTTP logs follow a standard pattern where each log looks like the following. Here the last token includes the size of the web page retrieved:

205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985

We will use the Java regular expressions' support to parse the log lines, and the Pattern.compile() method in the top of the class defines the regular expression. Since most Hadoop jobs involve text processing, regular expressions are a very useful tool while writing Hadoop Jobs:

private final static IntWritable one = new IntWritable(1);
public void map(Object key, Text value, 
  Context context) throws 
  IOException, InterruptedException
{
  Matcher matcher = httplogPattern.matcher(value.
    toString());
    if (matcher.matches())
    {
      int size = Integer.parseInt(matcher.group(5));
      context.write(new Text("msgSize"),one);
    }
}

The map task receives each line in the log file as a different key-value pair. It parses the lines using regular expressions and emits the file size against the key msgSize.

Then, Hadoop collects all values for the key and invokes the reducer. Reducer walks through all the values and calculates the minimum, maximum, and mean file size of the file downloaded from the web server. It is worth noting that by making the values available as an iterator, Hadoop gives the programmer a chance to process the data without storing them in memory. You should therefore try to process values without storing them in memory whenever possible.

public static class AReducer
  extends Reducer<Text, IntWritable, Text, IntWritable>
  {
  public void reduce(Text key, Iterable<IntWritable> values, 
  Context context) throws IOException,InterruptedException
  {
    double tot = 0;
    int count = 0;
    int min = Integer.MAX_VALUE;
    int max = 0;
    Iterator<IntWritable> iterator = values.iterator();
    while (iterator.hasNext())
    {
      int value = iterator.next().get();
      tot = tot + value;
      count++;
      if (value < min)
      {
        min = value;
       }
      if (value > max)
      {
       max = value;
       }
     }
    context.write(new Text("Mean"), 
    new IntWritable((int) tot / count));
    context.write(new Text("Max"), 
      new IntWritable(max));
    context.write(new Text("Min"), 
      new IntWritable(min));
   }
}

The main() method of the job looks similar to the WordCount example, except for the highlighted lines that has been changed to accommodate the input and output datatype changes:

Job job = new Job(conf, "LogProcessingMessageSizeAggregation");
job.setJarByClass(WebLogMessageSizeAggregator.class);
job.setMapperClass(AMapper.class);
job.setReducerClass(AReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

There's more...

You can learn more about Java regular expressions from the Java tutorial, http://docs.oracle.com/javase/tutorial/essential/regex/.