Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nutch 2.2.1 setup with HBase on hadoop cluster

I have referred this tutorial (http://wiki.apache.org/nutch/Nutch2Tutorial) to setup Nutch 2.2.1.with Hbase. I have completed the setup as given in the tutorial, but how to Crawl and store the data into Hbase tables is not mentioned clearly.

Can you please refer me to some relevant links/books for the same?

like image 503
Rahul Katare Avatar asked Dec 03 '25 08:12

Rahul Katare


1 Answers

Most helpful for me was this:

http://sujitpal.blogspot.cz/2011/01/exploring-nutch-20-hbase-storage.html

Mapping to hbase is defined here NUTCH_HOME/conf/gora-hbase-mapping.xml. So if everything is configured correctly, the crawl script should store it for you.

I have the same configuration and had many many problems to get it work, here are some tips:

Tip 1: be careful about table name

I configure also these properties:

<property>
  <name>storage.schema.webpage</name>
  <value>webpage</value>
</property>

<property>
  <name>storage.crawl.id</name>
  <value>babu</value>
</property>

This configuration will crawl data into babu_webpage table in hbase when you give argument -crawlId in script write simple 'babu' -> $CRAWL_ID.

$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $batchId -crawlId $CRAWL_ID -threads 50

Tip 2: if you have bad table name Nutch still write on console success.

Tip 3: how to simple see if there is crawled something in hbase:

go to ./bin/hbase shell

list
scan 'babu_webpage'
like image 167
Babu Avatar answered Dec 06 '25 06:12

Babu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!