I am using a scrapy CrawlSpider
and defined a twisted reactor to control my crawler. During the tests I crawled a news site collecting more than several GBs of data. Mostly I am interested in the newest stories so I am looking for a way to limit the number of requested pages, bytes or seconds.
Is there a common way to define a limit of
In scrapy
there is the class scrapy.extensions.closespider.CloseSpider
.
You can define the variables CLOSESPIDER_TIMEOUT
, CLOSESPIDER_ITEMCOUNT
, CLOSESPIDER_PAGECOUNT
and CLOSESPIDER_ERRORCOUNT
.
The spider closes automatically when the criteria is met: http://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.extensions.closespider
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With