Nutch vs Heritrix vs Stormcrawler vs MegaIndex vs Mixnode [closed]

Question

We need to crawl a large number (~1.5 billion) of web pages every two weeks. Speed, hence cost, is a huge factor for us as our initial attempts have ended up costing us over $20k.

Is there any data on which crawler performs the best in a distributed environment?

Sunil Kumbhar · Accepted Answer

~~We've only tried nutch, stormcrawler and mixnode. We eventually used mixnode to crawl ~300 million pages across 5k domains.~~

My $0.02: mixnode is the better choice for larger scale crawling (aka over 1 million urls). For smaller crawls it's an overkill since you would have to parse the resulting warc files and if you're doing only a few thousand pages it's just easier to run your own script or use an open source alternative like nutch or stormcrawler (or even scrapy).

Mixnode is now an "alternative" to web crawling, so it's a completely different product from my old answer.

Julien Nioche · Answer

For a comparison between Nutch and StormCrawler, see my article on dzone.

Heritrix can be used in distributed mode but the documentation is not very clear on how to do this. The previous 2 rely on well-established platforms for the distribution of the computation (Apache Hadoop and Apache Storm respectively), but this is not the case for Heritrix.

Heritrix is also used mostly by the archiving community, whereas Nutch and StormCrawler are used for a wider number of use cases (e.g. indexing, scraping) and have more resources for extracting data.

I am not familiar with the 2 hosted services you mention as I use only open source software.

Nutch vs Heritrix vs Stormcrawler vs MegaIndex vs Mixnode [closed]

Tags:

web-crawler

nutch

heritrix

stormcrawler

Anakin

2 Answers

Sunil Kumbhar

Julien Nioche

Recent Activity

Donate For Us

Nutch vs Heritrix vs Stormcrawler vs MegaIndex vs Mixnode [closed]

Tags:

web-crawler

nutch

heritrix

stormcrawler

Anakin

2 Answers

Sunil Kumbhar

Julien Nioche

Related questions

Recent Activity

Donate For Us