Nutch 2.2.1 setup with HBase on hadoop cluster

Question

I have referred this tutorial (http://wiki.apache.org/nutch/Nutch2Tutorial) to setup Nutch 2.2.1.with Hbase. I have completed the setup as given in the tutorial, but how to Crawl and store the data into Hbase tables is not mentioned clearly.

Can you please refer me to some relevant links/books for the same?

Babu · Accepted Answer

Most helpful for me was this:

http://sujitpal.blogspot.cz/2011/01/exploring-nutch-20-hbase-storage.html

Mapping to hbase is defined here NUTCH_HOME/conf/gora-hbase-mapping.xml. So if everything is configured correctly, the crawl script should store it for you.

I have the same configuration and had many many problems to get it work, here are some tips:

Tip 1: be careful about table name

I configure also these properties:

<property>
  <name>storage.schema.webpage</name>
  <value>webpage</value>
</property>

<property>
  <name>storage.crawl.id</name>
  <value>babu</value>
</property>

This configuration will crawl data into babu_webpage table in hbase when you give argument -crawlId in script write simple 'babu' -> $CRAWL_ID.

$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $batchId -crawlId $CRAWL_ID -threads 50

Tip 2: if you have bad table name Nutch still write on console success.

Tip 3: how to simple see if there is crawled something in hbase:

go to ./bin/hbase shell

list
scan 'babu_webpage'

Nutch 2.2.1 setup with HBase on hadoop cluster

Tags:

apache

hadoop

web-crawler

hbase

nutch

Rahul Katare

1 Answers

Babu

Recent Activity

Donate For Us

Nutch 2.2.1 setup with HBase on hadoop cluster

Tags:

apache

hadoop

web-crawler

hbase

nutch

Rahul Katare

1 Answers

Babu

Related questions

Recent Activity

Donate For Us