I have referred this tutorial (http://wiki.apache.org/nutch/Nutch2Tutorial) to setup Nutch 2.2.1.with Hbase. I have completed the setup as given in the tutorial, but how to Crawl and store the data into Hbase tables is not mentioned clearly.
Can you please refer me to some relevant links/books for the same?
Most helpful for me was this:
http://sujitpal.blogspot.cz/2011/01/exploring-nutch-20-hbase-storage.html
Mapping to hbase is defined here NUTCH_HOME/conf/gora-hbase-mapping.xml. So if everything is configured correctly, the crawl script should store it for you.
I have the same configuration and had many many problems to get it work, here are some tips:
Tip 1: be careful about table name
I configure also these properties:
<property>
<name>storage.schema.webpage</name>
<value>webpage</value>
</property>
<property>
<name>storage.crawl.id</name>
<value>babu</value>
</property>
This configuration will crawl data into babu_webpage table in hbase when you give argument -crawlId in script write simple 'babu' -> $CRAWL_ID.
$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $batchId -crawlId $CRAWL_ID -threads 50
Tip 2: if you have bad table name Nutch still write on console success.
Tip 3: how to simple see if there is crawled something in hbase:
go to ./bin/hbase shell
list
scan 'babu_webpage'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With