Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Crawling Speed is Slow (60 pages / min)

I am experiencing slow crawl speeds with scrapy (around 1 page / sec). I'm crawling a major website from aws servers so I don't think its a network issue. Cpu utilization is nowhere near 100 and if I start multiple scrapy processes crawl speed is much faster.

Scrapy seems to crawl a bunch of pages, then hangs for several seconds, and then repeats.

I've tried playing with: CONCURRENT_REQUESTS = CONCURRENT_REQUESTS_PER_DOMAIN = 500

but this doesn't really seem to move the needle past about 20.

like image 222
somewire Avatar asked Nov 22 '12 02:11

somewire


People also ask

What is the fastest way to scrape off a Scrapy?

One workaround to speed up your scrapy is to config your start_urls appropriately. In this way, requests will start from 250 to 251,249 and from 750 to 751,749 simultaneously, so you will get 4 times faster compared to start_urls = ["http://apps.webofknowledge.com/doc=1"] . Save this answer.

How fast is Scrapy?

That tells you that Scrapy is able to crawl about 3000 pages per minute in the hardware where you run it. Note that this is a very simple spider intended to follow links, any custom spider you write will probably do more stuff which results in slower crawl rates.

How do you slow down a Scrapy?

if you want to keep a download delay of exactly one second, setting DOWNLOAD_DELAY=1 is the way to do it. But scrapy also has a feature to automatically set download delays called AutoThrottle . It automatically sets delays based on load of both the Scrapy server and the website you are crawling.

Why is Scrapy faster?

One of the biggest advantages of Scrapy is speed . Since it's asynchronous, Scrapy spiders don't have to wait to make requests one at a time, but it can make requests in parallel.


1 Answers

Are you sure you are allowed to crawl the destination site at high speed? Many sites implement download threshold and "after a while" start responding slowly.

like image 126
gvtech Avatar answered Sep 18 '22 14:09

gvtech