I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis.
I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr since I may require to search text in future) and ran the crawl using a few URLs as the seed.
Now, I don't find the text/html
data in my local machine. Where can I find the data and what is the best way to read the data in text format?
Nutch is a highly extensible, highly scalable, matured, production-ready Web crawler which enables fine grained configuration and accomodates a wide variety of data acquisition tasks.
Nutch takes the injected URLs, stores them in the CrawlDB, and uses those links to go out to the web and scrape each URL. Then, it parses the scraped data into various fields and pushes any scraped hyperlinks back into the CrawlDB.
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Nutch is an open source crawler which provides the Java library for crawling, indexing and database storage. Solr is an open source search platform which provides full-text search and integration with Nutch. The following contents are steps of setting up Nutch and Solr for crawling and searching.
After your crawl is over, you could use the bin/nutch dump command to dump all the urls fetched in plain html format.
The usage is as follows :
$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>]
[-segment <segment>]
-h,--help show this help message
-mimetype <mimetype> an optional list of mimetypes to dump, excluding
all others. Defaults to all.
-outputDir <outputDir> output directory (which will be created) to host
the raw data
-segment <segment> the segment(s) to use
So for example you could do something like
$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/
This would create a new dir at the -outputDir location and dump all the pages crawled in html format.
There are many more ways of dumping out specific data from Nutch, have a look at https://wiki.apache.org/nutch/CommandLineOptions
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With