I am experiencing slow crawl speeds with scrapy (around 1 page / sec). I'm crawling a major website from aws servers so I don't think its a network issue. Cpu utilization is nowhere near 100 and if I start multiple scrapy processes crawl speed is much faster.
Scrapy seems to crawl a bunch of pages, then hangs for several seconds, and then repeats.
I've tried playing with: CONCURRENT_REQUESTS = CONCURRENT_REQUESTS_PER_DOMAIN = 500
but this doesn't really seem to move the needle past about 20.
One workaround to speed up your scrapy is to config your start_urls appropriately. In this way, requests will start from 250 to 251,249 and from 750 to 751,749 simultaneously, so you will get 4 times faster compared to start_urls = ["http://apps.webofknowledge.com/doc=1"] . Save this answer.
That tells you that Scrapy is able to crawl about 3000 pages per minute in the hardware where you run it. Note that this is a very simple spider intended to follow links, any custom spider you write will probably do more stuff which results in slower crawl rates.
if you want to keep a download delay of exactly one second, setting DOWNLOAD_DELAY=1 is the way to do it. But scrapy also has a feature to automatically set download delays called AutoThrottle . It automatically sets delays based on load of both the Scrapy server and the website you are crawling.
One of the biggest advantages of Scrapy is speed . Since it's asynchronous, Scrapy spiders don't have to wait to make requests one at a time, but it can make requests in parallel.
Are you sure you are allowed to crawl the destination site at high speed? Many sites implement download threshold and "after a while" start responding slowly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With