I have 2 problems with my scraper:
It get's a lot of 302s after a while, despite the fact I use 'COOKIES_ENABLED': False
, and rotating proxy which should provide different IP for each request. I solved it by restarting scraper after several 302s
I see that scraper successfully crawls much more than it process, and I can't do anything with it. In the example below I've got 121 200s responses, but only 27 was processed.
Spider
class MySpider(Spider):
name = 'MySpider'
custom_settings = {
'DOWNLOAD_DELAY': 0,
'RETRY_TIMES': 1,
'LOG_LEVEL': 'DEBUG',
'CLOSESPIDER_ERRORCOUNT': 3,
'COOKIES_ENABLED': False,
}
# I need to manually control when spider to stop, otherwise it runs forever
handle_httpstatus_list = [301, 302]
def start_requests(self):
for row in self.df.itertuples():
yield Request(
url=row.link,
callback=self.parse,
priority=100
)
def close(self, reason):
self.logger.info('TOTAL ADDED: %s' % self.added)
def parse(self, r):
if r.status == 302:
# I need to manually control when spider to stop, otherwise it runs forever
raise CloseSpider("")
else:
# do parsing stuff
self.added += 1
self.logger.info('{} left'.format(len(self.df[self.df['status'] == 0])))
Output
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url1> (referer: None)
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url2> (referer: None)
2018-08-08 12:24:24 [MySpider] INFO: 52451 left
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url3> (referer: None)
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url4> (referer: None)
2018-08-08 12:24:24 [MySpider] INFO: 52450 left
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url4> (referer: None)
2018-08-08 12:24:37 [MySpider] INFO: TOTAL ADDED: 27
2018-08-08 12:24:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
...
...
'downloader/response_status_count/200': 121,
'downloader/response_status_count/302': 4,
It succesfully crawls much (3x or 4x times more than crawls). How can I force to process everything that was crawled?
I can sacrifice the speed, but I don't want to waste what was successfully crawled 200s
When your spider crawls a website, Scrapy automatically handles the cookie for you, storing and sending it in subsequent requests to the same site.
The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the Spider class. This method, as well as any other Request callback, must return an iterable of Request and/or item objects.
This tutorial explains how to use yield in Scrapy. You can use regular methods such as printing and logging or using regular file handling methods to save the data returned from the Scrapy Spider. However, Scrapy offers an inbuilt way of saving and storing data through the yield keyword.
The scheduler may not have delivered all the 200
responses to the parse()
method when you CloseSpider()
.
Log and ignore the 302
s, and let the spider finish.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With