We need to crawl a large number (~1.5 billion) of web pages every two weeks. Speed, hence cost, is a huge factor for us as our initial attempts have ended up costing us over $20k.
Is there any data on which crawler performs the best in a distributed environment?
We've only tried nutch, stormcrawler and mixnode. We eventually used mixnode to crawl ~300 million pages across 5k domains.
My $0.02: mixnode is the better choice for larger scale crawling (aka over 1 million urls). For smaller crawls it's an overkill since you would have to parse the resulting warc files and if you're doing only a few thousand pages it's just easier to run your own script or use an open source alternative like nutch or stormcrawler (or even scrapy).
Mixnode is now an "alternative" to web crawling, so it's a completely different product from my old answer.
For a comparison between Nutch and StormCrawler, see my article on dzone.
Heritrix can be used in distributed mode but the documentation is not very clear on how to do this. The previous 2 rely on well-established platforms for the distribution of the computation (Apache Hadoop and Apache Storm respectively), but this is not the case for Heritrix.
Heritrix is also used mostly by the archiving community, whereas Nutch and StormCrawler are used for a wider number of use cases (e.g. indexing, scraping) and have more resources for extracting data.
I am not familiar with the 2 hosted services you mention as I use only open source software.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With