Best web graph crawler for speed?

Question

For the past month I've been using Scrapy for a web crawling project I've begun.

This project involves pulling down the full document content of all web pages in a single domain name that are reachable from the home page. Writing this using Scrapy was quite easy, but it simply runs too slowly. In 2-3 days I can only pull down 100,000 pages.

I've realized that my initial notion that Scrapy isn't meant for this type of crawl is revealing itself.

I've begun to focus my sights on Nutch and Methabot in hopes of better performance. The only data that I need to store during the crawl is the full content of the web page and preferably all the links on the page (but even that can be done in post-processing).

I'm looking for a crawler that is fast and employs many parallel requests.

whalebot.helmsman · Accepted Answer

This my be fault of server not Scrapy. Server may be not so fast as you want or may be it (or webmaster) detects crawling and limit speed for this connection/cookie. Do you use proxy? This may slow down crawling too. This may be Scrapy wisdom, if you will crawl too intensive you may get ban on this server. For my C++ handwritten crawler I artificially set 1 request per second limit. But this speed is enough for 1 thread ( 1 req * 60 secs * 60 minutes * 24 hours = 86400 req / day ). If you interested you may write email to whalebot.helmsman {AT} gmail.com .

Best web graph crawler for speed?

Tags:

scrapy

web-crawler

nutch

OregonTrail

1 Answers

whalebot.helmsman

Recent Activity

Donate For Us

Best web graph crawler for speed?

Tags:

scrapy

web-crawler

nutch

OregonTrail

1 Answers

whalebot.helmsman

Related questions

Recent Activity

Donate For Us