Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best web graph crawler for speed?

For the past month I've been using Scrapy for a web crawling project I've begun.

This project involves pulling down the full document content of all web pages in a single domain name that are reachable from the home page. Writing this using Scrapy was quite easy, but it simply runs too slowly. In 2-3 days I can only pull down 100,000 pages.

I've realized that my initial notion that Scrapy isn't meant for this type of crawl is revealing itself.

I've begun to focus my sights on Nutch and Methabot in hopes of better performance. The only data that I need to store during the crawl is the full content of the web page and preferably all the links on the page (but even that can be done in post-processing).

I'm looking for a crawler that is fast and employs many parallel requests.

like image 733
OregonTrail Avatar asked Aug 06 '10 13:08

OregonTrail


1 Answers

This my be fault of server not Scrapy. Server may be not so fast as you want or may be it (or webmaster) detects crawling and limit speed for this connection/cookie. Do you use proxy? This may slow down crawling too. This may be Scrapy wisdom, if you will crawl too intensive you may get ban on this server. For my C++ handwritten crawler I artificially set 1 request per second limit. But this speed is enough for 1 thread ( 1 req * 60 secs * 60 minutes * 24 hours = 86400 req / day ). If you interested you may write email to whalebot.helmsman {AT} gmail.com .

like image 178
whalebot.helmsman Avatar answered Sep 21 '22 13:09

whalebot.helmsman