Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nutch vs Heritrix vs Stormcrawler vs MegaIndex vs Mixnode [closed]

We need to crawl a large number (~1.5 billion) of web pages every two weeks. Speed, hence cost, is a huge factor for us as our initial attempts have ended up costing us over $20k.

Is there any data on which crawler performs the best in a distributed environment?

like image 726
Anakin Avatar asked Dec 18 '22 03:12

Anakin


2 Answers

We've only tried nutch, stormcrawler and mixnode. We eventually used mixnode to crawl ~300 million pages across 5k domains.

My $0.02: mixnode is the better choice for larger scale crawling (aka over 1 million urls). For smaller crawls it's an overkill since you would have to parse the resulting warc files and if you're doing only a few thousand pages it's just easier to run your own script or use an open source alternative like nutch or stormcrawler (or even scrapy).

Mixnode is now an "alternative" to web crawling, so it's a completely different product from my old answer.

like image 62
Sunil Kumbhar Avatar answered Apr 18 '23 03:04

Sunil Kumbhar


For a comparison between Nutch and StormCrawler, see my article on dzone.

Heritrix can be used in distributed mode but the documentation is not very clear on how to do this. The previous 2 rely on well-established platforms for the distribution of the computation (Apache Hadoop and Apache Storm respectively), but this is not the case for Heritrix.

Heritrix is also used mostly by the archiving community, whereas Nutch and StormCrawler are used for a wider number of use cases (e.g. indexing, scraping) and have more resources for extracting data.

I am not familiar with the 2 hosted services you mention as I use only open source software.

like image 42
Julien Nioche Avatar answered Apr 18 '23 05:04

Julien Nioche