I have 2 problems with my scraper: <ol> <li>It get's a lot of 302s after a while, despite the fact I use <code>'COOKIES_ENABLED': False</code>, and rotating proxy which should provide different IP for each request. I solved it by restarting scraper after several 302s</li> <li>I see that scraper successfully crawls much more than it process, and I can't do anything with it. In the example below I've got 121 200s responses, but only 27 was processed.</li> </ol> Spider <pre class="prettyprint"><code>class MySpider(Spider): name = 'MySpider' custom_settings = { 'DOWNLOAD_DELAY': 0, 'RETRY_TIMES': 1, 'LOG_LEVEL': 'DEBUG', 'CLOSESPIDER_ERRORCOUNT': 3, 'COOKIES_ENABLED': False, } # I need to manually control when spider to stop, otherwise it runs forever handle_httpstatus_list = [301, 302] def start_requests(self): for row in self.df.itertuples(): yield Request( url=row.link, callback=self.parse, priority=100 ) def close(self, reason): self.logger.info('TOTAL ADDED: %s' % self.added) def parse(self, r): if r.status == 302: # I need to manually control when spider to stop, otherwise it runs forever raise CloseSpider("") else: # do parsing stuff self.added += 1 self.logger.info('{} left'.format(len(self.df[self.df['status'] == 0]))) </code></pre> Output <pre class="prettyprint"><code>2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url1> (referer: None) 2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url2> (referer: None) 2018-08-08 12:24:24 [MySpider] INFO: 52451 left 2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url3> (referer: None) 2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url4> (referer: None) 2018-08-08 12:24:24 [MySpider] INFO: 52450 left 2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url4> (referer: None) 2018-08-08 12:24:37 [MySpider] INFO: TOTAL ADDED: 27 2018-08-08 12:24:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/exception_count': 1, ... ... 'downloader/response_status_count/200': 121, 'downloader/response_status_count/302': 4, </code></pre> It succesfully crawls much (3x or 4x times more than crawls). How can I force to process everything that was crawled? I can sacrifice the speed, but I don't want to waste what was successfully crawled 200s

The scheduler may not have delivered all the <code>200</code> responses to the <code>parse()</code> method when you <code>CloseSpider()</code>. Log and ignore the <code>302</code>s, and let the spider finish.

Scrapy process less than succesfully crawled

Tags:

python

scrapy

scrapy-spider

I have 2 problems with my scraper:

It get's a lot of 302s after a while, despite the fact I use 'COOKIES_ENABLED': False, and rotating proxy which should provide different IP for each request. I solved it by restarting scraper after several 302s
I see that scraper successfully crawls much more than it process, and I can't do anything with it. In the example below I've got 121 200s responses, but only 27 was processed.

Spider

class MySpider(Spider):
    name = 'MySpider'
    custom_settings = {
        'DOWNLOAD_DELAY': 0,
        'RETRY_TIMES': 1,
        'LOG_LEVEL': 'DEBUG',
        'CLOSESPIDER_ERRORCOUNT': 3,
        'COOKIES_ENABLED': False,
    }
    # I need to manually control when spider to stop, otherwise it runs forever
    handle_httpstatus_list = [301, 302]

    def start_requests(self):
        for row in self.df.itertuples():
            yield Request(
                url=row.link,
                callback=self.parse,
                priority=100
            )

    def close(self, reason):
        self.logger.info('TOTAL ADDED: %s' % self.added)

    def parse(self, r):
        if r.status == 302:
            # I need to manually control when spider to stop, otherwise it runs forever
            raise CloseSpider("")
        else:
            # do parsing stuff
                self.added += 1
                self.logger.info('{} left'.format(len(self.df[self.df['status'] == 0])))

Output

2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url1> (referer: None)
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url2> (referer: None)
2018-08-08 12:24:24 [MySpider] INFO: 52451 left
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url3> (referer: None)
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url4> (referer: None)
2018-08-08 12:24:24 [MySpider] INFO: 52450 left
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url4> (referer: None)


2018-08-08 12:24:37 [MySpider] INFO: TOTAL ADDED: 27
2018-08-08 12:24:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
...
...
 'downloader/response_status_count/200': 121,
 'downloader/response_status_count/302': 4,

It succesfully crawls much (3x or 4x times more than crawls). How can I force to process everything that was crawled?

I can sacrifice the speed, but I don't want to waste what was successfully crawled 200s

642

asked Aug 08 '18 10:08

Bendeberia

1 Answers

The scheduler may not have delivered all the 200 responses to the parse() method when you CloseSpider().

Log and ignore the 302s, and let the spider finish.

answered Nov 15 '22 00:11

Apalala

Related questions
                            
                                CFFI fails in Python (Linux) virtual environment -- attempting to install cryptography package in venv
                            
                                DeviceCheck: Unable to verify authorization token
                            
                                "ValueError: Trying to share variable $var, but specified dtype float32 and found dtype float64_ref" when trying to use get_variable
                            
                                Why can't I access builtins if I use a custom dict as a function's globals?
                            
                                pandas rolling() function with monthly offset
                            
                                Reading pandas dataframe that contains dictionaries in cells from csv
                            
                                Should I capitalize constant in Python?
                            
                                What are Queue classes, Worker Classes, Job Classes in Python rq package
                            
                                resolving package resolutions in conda
                            
                                Python fold/reduce composition of multiple dictionaries
                            
                                Pandas / matplotlib is showing 2018 and 2019 years as 48 and 49
                            
                                Install ODBC Driver heroku
                            
                                How can I set boundary of Content-type using python requests?
                            
                                Update Sharepoint 2013 using Python3
                            
                                Python/Pandas - Query a MultiIndex Column [duplicate]
                            
                                LinearNDInterpolator -- Qhull precision error: Initial simplex is flat
                            
                                newrelic agent is not sending data to newrelic servers at staging only
                            
                                Python - Get list of all attributes/properties of a win32com class
                            
                                Select Multilines using Lasso Tool
                            
                                Passing arguments to cell magic %%script

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With