Scrapy: Limit the number of request or request bytes

Question

I am using a scrapy CrawlSpider and defined a twisted reactor to control my crawler. During the tests I crawled a news site collecting more than several GBs of data. Mostly I am interested in the newest stories so I am looking for a way to limit the number of requested pages, bytes or seconds.

Is there a common way to define a limit of

request_bytes
request_counts or
run time in seconds?

Is there a common way to define a limit of

request_bytes
request_counts or
run time in seconds?

Jon · Accepted Answer

In scrapy there is the class scrapy.extensions.closespider.CloseSpider. You can define the variables CLOSESPIDER_TIMEOUT, CLOSESPIDER_ITEMCOUNT, CLOSESPIDER_PAGECOUNT and CLOSESPIDER_ERRORCOUNT.

The spider closes automatically when the criteria is met: http://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.extensions.closespider

Scrapy: Limit the number of request or request bytes

Tags:

python

scrapy

Jon

1 Answers

Jon

Recent Activity

Donate For Us

Scrapy: Limit the number of request or request bytes

Tags:

python

scrapy

Jon

1 Answers

Jon

Related questions

Recent Activity

Donate For Us