PacktLib: Hadoop MapReduce Cookbook

Hadoop MapReduce Cookbook

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Getting Hadoop Up and Running in a Cluster

Introduction

Setting up Hadoop on your machine

Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop

Adding the combiner step to the WordCount MapReduce program

Setting up HDFS

Using HDFS monitoring UI

HDFS basic command-line file operations

Setting Hadoop in a distributed cluster environment

Running the WordCount program in a distributed cluster environment

Using MapReduce monitoring UI

Advanced HDFS

Introduction

Benchmarking HDFS

Adding a new DataNode

Decommissioning DataNodes

Using multiple disks/volumes and limiting HDFS disk usage

Setting HDFS block size

Setting the file replication factor

Using HDFS Java API

Using HDFS C API (libhdfs)

Mounting HDFS (Fuse-DFS)

Merging files in HDFS

Advanced Hadoop MapReduce Administration

Introduction

Tuning Hadoop configurations for cluster deployments

Running benchmarks to verify the Hadoop installation

Reusing Java VMs to improve the performance

Fault tolerance and speculative execution

Debug scripts – analyzing task failures

Setting failure percentages and skipping bad records

Shared-user Hadoop clusters – using fair and other schedulers

Hadoop security – integrating with Kerberos

Using the Hadoop Tool interface

Developing Complex Hadoop MapReduce Applications

Introduction

Choosing appropriate Hadoop data types

Implementing a custom Hadoop Writable data type

Implementing a custom Hadoop key type

Emitting data of different value types from a mapper

Choosing a suitable Hadoop InputFormat for your input data format

Adding support for new input data formats – implementing a custom InputFormat

Formatting the results of MapReduce computations – using Hadoop OutputFormats

Hadoop intermediate (map to reduce) data partitioning

Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache

Using Hadoop with legacy applications – Hadoop Streaming

Adding dependencies between MapReduce jobs

Hadoop counters for reporting custom metrics

Hadoop Ecosystem

Introduction

Installing HBase

Data random access using Java client APIs

Running MapReduce jobs on HBase (table input/output)

Installing Pig

Running your first Pig command

Set operations (join, union) and sorting with Pig

Installing Hive

Running a SQL-style query with Hive

Performing a join with Hive

Installing Mahout

Running K-means with Mahout

Visualizing K-means results

Analytics

Introduction

Simple analytics using MapReduce

Performing Group-By using MapReduce

Calculating frequency distributions and sorting using MapReduce

Plotting the Hadoop results using GNU Plot

Calculating histograms using MapReduce

Calculating scatter plots using MapReduce

Parsing a complex dataset with Hadoop

Joining two datasets using MapReduce

Searching and Indexing

Introduction

Generating an inverted index using Hadoop MapReduce

Intra-domain web crawling using Apache Nutch

Indexing and searching web documents using Apache Solr

Configuring Apache HBase as the backend data store for Apache Nutch

Deploying Apache HBase on a Hadoop cluster

Whole web crawling with Apache Nutch using a Hadoop/HBase cluster

ElasticSearch for indexing and searching

Generating the in-links graph for crawled web pages

Classifications, Recommendations, and Finding Relationships

Introduction

Content-based recommendations

Hierarchical clustering

Clustering an Amazon sales dataset

Collaborative filtering-based recommendations

Classification using Naive Bayes Classifier

Assigning advertisements to keywords using the Adwords balance algorithm

Mass Text Data Processing

Introduction

Data preprocessing (extract, clean, and format conversion) using Hadoop Streaming and Python

Data de-duplication using Hadoop Streaming

Loading large datasets to an Apache HBase data store using importtsv and bulkload tools

Creating TF and TF-IDF vectors for the text data

Clustering the text data

Topic discovery using Latent Dirichlet Allocation (LDA)

Document classification using Mahout Naive Bayes classifier

Cloud Deployments: Using Hadoop on Clouds

Introduction

Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)

Saving money by using Amazon EC2 Spot Instances to execute EMR job flows

Executing a Pig script using EMR

Executing a Hive script using EMR

Creating an Amazon EMR job flow using the Command Line Interface

Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR

Using EMR Bootstrap actions to configure VMs for the Amazon EMR jobs

Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment

Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment

Index