Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where is the crawled data stored when running nutch crawler?

I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis.

I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr since I may require to search text in future) and ran the crawl using a few URLs as the seed.

Now, I don't find the text/html data in my local machine. Where can I find the data and what is the best way to read the data in text format?

Versions

  • apache-nutch-1.9
  • solr-4.10.4
like image 275
Marco99 Avatar asked Mar 30 '15 09:03

Marco99


People also ask

What is a nutch crawl?

Nutch is a highly extensible, highly scalable, matured, production-ready Web crawler which enables fine grained configuration and accomodates a wide variety of data acquisition tasks.

How does Apache Nutch work?

Nutch takes the injected URLs, stores them in the CrawlDB, and uses those links to go out to the web and scrape each URL. Then, it parses the scraped data into various fields and pushes any scraped hyperlinks back into the CrawlDB.

Is Apache Nutch open source?

Apache Nutch is a highly extensible and scalable open source web crawler software project.

What is Nutch in Solr?

Nutch is an open source crawler which provides the Java library for crawling, indexing and database storage. Solr is an open source search platform which provides full-text search and integration with Nutch. The following contents are steps of setting up Nutch and Solr for crawling and searching.


1 Answers

After your crawl is over, you could use the bin/nutch dump command to dump all the urls fetched in plain html format.

The usage is as follows :

$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>]
   [-segment <segment>]
 -h,--help                show this help message
 -mimetype <mimetype>     an optional list of mimetypes to dump, excluding
                      all others. Defaults to all.
 -outputDir <outputDir>   output directory (which will be created) to host
                      the raw data
 -segment <segment>       the segment(s) to use

So for example you could do something like

$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/

This would create a new dir at the -outputDir location and dump all the pages crawled in html format.

There are many more ways of dumping out specific data from Nutch, have a look at https://wiki.apache.org/nutch/CommandLineOptions

like image 117
Sujen Shah Avatar answered Sep 21 '22 00:09

Sujen Shah