PacktLib: Web Crawling and Data Mining with Apache Nutch

Web Crawling and Data Mining with Apache Nutch

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Getting Started with Apache Nutch

Introduction to Apache Nutch

Installing and configuring Apache Nutch

Crawling your website using the crawl script

Crawling the Web, the CrawlDb, and URL filters

Parsing and parse filters

The Apache Nutch plugin

Understanding the Nutch Plugin architecture

Summary

Deployment, Sharding, and AJAX Solr with Apache Nutch

Deployment of Apache Solr

Sharding using Apache Solr

Working with AJAX Solr

Summary

Integration of Apache Nutch with Apache Hadoop and Eclipse

Integrating Apache Nutch with Apache Hadoop

Configuring Apache Nutch with Eclipse

Summary

Apache Nutch with Gora, Accumulo, and MySQL

Introduction to Apache Accumulo

Introduction to Apache Gora

Use of Apache Gora

Integration of Apache Nutch with Apache Accumulo

Integration of Apache Nutch with MySQL

Summary

Index