Where is the crawled data stored when running nutch crawler?

Versions

apache-nutch-1.9
solr-4.10.4

275

asked Mar 30 '15 09:03

Marco99

1 Answers

After your crawl is over, you could use the bin/nutch dump command to dump all the urls fetched in plain html format.

The usage is as follows :

$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>]
   [-segment <segment>]
 -h,--help                show this help message
 -mimetype <mimetype>     an optional list of mimetypes to dump, excluding
                      all others. Defaults to all.
 -outputDir <outputDir>   output directory (which will be created) to host
                      the raw data
 -segment <segment>       the segment(s) to use

So for example you could do something like

$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/

This would create a new dir at the -outputDir location and dump all the pages crawled in html format.

There are many more ways of dumping out specific data from Nutch, have a look at https://wiki.apache.org/nutch/CommandLineOptions

117

answered Sep 21 '22 00:09

Sujen Shah

Related questions
                            
                                Crawlable AJAX with _escaped_fragment_ in htaccess
                            
                                Equivalent of wget in Python to download website and resources
                            
                                Lucene - Reading all field names that are stored
                            
                                Using Web crawler for price comparison
                            
                                What does the dollar sign mean in robots.txt
                            
                                Run Multiple Spider sequentially
                            
                                After doing HttpWebRequests for a while the result starts timing out
                            
                                Deny access but allow robots i.e. Google to sitemap.xml
                            
                                How can I bring google-like recrawling in my application(web or console)
                            
                                Crawler url queue or hash list?
                            
                                running multiple threads in python, simultaneously - is it possible?
                            
                                Will Googlebot crawl changes to the DOM made with JavaScript?
                            
                                python-how to crawl past __VIEWSTATE
                            
                                Scrapy: downloader/response_count vs response_received_count
                            
                                Is it possible to scrape all text messages from Whatsapp Web with Scrapy?
                            
                                how to allow known web crawlers and block spammers and harmful robots from scanning asp.net website
                            
                                port error in scrapy
                            
                                How do I extract data from a website using javascript.
                            
                                DFS vs BFS in web crawler design [closed]
                            
                                How write code to web crawling and scraping in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Where is the crawled data stored when running nutch crawler?

Tags:

web-crawler

nutch

Versions

Marco99

People also ask

1 Answers

Sujen Shah

Recent Activity

Donate For Us