I am using scrapy 1.1 to scrape a website. The site requires periodic relogin. I can tell when this is needed because when login is required a 302 redirection occurs. Based on # http://sangaline.com/post/advanced-web-scraping-tutorial/ , I have subclassed the RedirectMiddleware, making the location http header available in the spider under: <pre class="prettyprint"><code>request.meta['redirect_urls'] </code></pre> My problem is that after logging in , I have set up a function to loop through 100 pages to scrape . Lets say after 15 pages I see that I have to log back in (based on the contents of request.meta['redirect_urls']) . My code looks like: <pre class="prettyprint"><code>def test1(self, response): ...... for row in empties: # 100 records d = object_as_dict(row) AA yield Request(url=myurl,headers=self.headers, callback=self.parse_lookup, meta={d':d}, dont_filter=True) def parse_lookup(self, response): if 'redirect_urls' in response.meta: print str(response.meta['redirect_urls']) BB d = response.meta['d'] </code></pre> So as you can see, I get 'notified' of the need to relogin in parse_lookup at BB , but need to feed this information back to cancel the loop creating requests in test1 (AA). How can I make the information in parse lookup available in the prior callback function?

Why not use a DownloaderMiddleware? You could write a DownloaderMiddleware like so: Edit: I have edited the original code to address a second problem the OP had in the comments. <pre class="prettyprint"><code>from scrapy.http import Request class CustomMiddleware(): def process_response(self, request, response, spider): if 'redirect_urls' in response.meta: # assuming your spider has a method for handling the login original_url = response.meta["redirect_urls"][0] return Request(url="login_url", callback=spider.login, meta={"original_url": original_url}) return response </code></pre> So you "intercept" the response before it goes to the parse_lookup and relogin/fix what is wrong and yield new requests... Like Tomá&scaron; Linhart said the requests are asynchronous so I don't know if you could run into problems by "reloging in" several times in a row, as multiple requests might be redirected at the same time. Remember to add the middleware to your settings: <pre class="prettyprint"><code>DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 542, 'myproject.middlewares.CustomDownloaderMiddleware': 543, } </code></pre>

You can't achieve what you want because Scrapy uses asynchronous processing. In theory you could use approach partially suggested in comment by @Paulo Scardine, i.e. raise an exception in <code>parse_lookup</code>. For it to be useful, you would then have to code your spider middleware and handle this exception in <code>process_spider_exception</code> method to log back in and retry failed requests. But I think better and simpler approach would be to do the same once you detect the need to login, i.e. in <code>parse_lookup</code>. Not sure exactly how <code>CONCURRENT_REQUESTS_PER_DOMAIN</code> works, but setting this to <code>1</code> might let you process one request at time and so there should be no failing requests as you always log back in when you need to.

Scrapy : Sending information to prior function

Tags:

python

scrapy

I am using scrapy 1.1 to scrape a website. The site requires periodic relogin. I can tell when this is needed because when login is required a 302 redirection occurs. Based on # http://sangaline.com/post/advanced-web-scraping-tutorial/ , I have subclassed the RedirectMiddleware, making the location http header available in the spider under:

request.meta['redirect_urls']

My problem is that after logging in , I have set up a function to loop through 100 pages to scrape . Lets say after 15 pages I see that I have to log back in (based on the contents of request.meta['redirect_urls']) . My code looks like:

def test1(self, response):

    ......
    for row in empties: # 100 records
        d = object_as_dict(row)

        AA

        yield Request(url=myurl,headers=self.headers, callback=self.parse_lookup, meta={d':d}, dont_filter=True)

def parse_lookup(self, response):

    if 'redirect_urls' in response.meta:
        print str(response.meta['redirect_urls'])

        BB

    d = response.meta['d']

So as you can see, I get 'notified' of the need to relogin in parse_lookup at BB , but need to feed this information back to cancel the loop creating requests in test1 (AA). How can I make the information in parse lookup available in the prior callback function?

311

asked Jul 21 '17 14:07

user1592380

2 Answers

Why not use a DownloaderMiddleware?

You could write a DownloaderMiddleware like so:

Edit: I have edited the original code to address a second problem the OP had in the comments.

from scrapy.http import Request

class CustomMiddleware():

    def process_response(self, request, response, spider):
        if 'redirect_urls' in response.meta:
            # assuming your spider has a method for handling the login
            original_url = response.meta["redirect_urls"][0]
            return Request(url="login_url", 
                           callback=spider.login, 
                           meta={"original_url": original_url})
        return response

So you "intercept" the response before it goes to the parse_lookup and relogin/fix what is wrong and yield new requests...

Like Tomáš Linhart said the requests are asynchronous so I don't know if you could run into problems by "reloging in" several times in a row, as multiple requests might be redirected at the same time.

Remember to add the middleware to your settings:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 542,
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
}

186

answered Oct 08 '22 16:10

Henrique Coura

You can't achieve what you want because Scrapy uses asynchronous processing.

In theory you could use approach partially suggested in comment by @Paulo Scardine, i.e. raise an exception in parse_lookup. For it to be useful, you would then have to code your spider middleware and handle this exception in process_spider_exception method to log back in and retry failed requests.

But I think better and simpler approach would be to do the same once you detect the need to login, i.e. in parse_lookup. Not sure exactly how CONCURRENT_REQUESTS_PER_DOMAIN works, but setting this to 1 might let you process one request at time and so there should be no failing requests as you always log back in when you need to.

answered Oct 08 '22 14:10

Tomáš Linhart

Related questions
                            
                                How does one achieve parallel gzip compression with Python?
                            
                                How I can specify SQS queue name in celery
                            
                                How to use pretrained Word2Vec model in Tensorflow
                            
                                datetime difference in python adjusted for night time
                            
                                Can I still specify a path to chromedriver using ChromeOptions in Python?
                            
                                Installed Anaconda 4.3.1 (64-bit) which contains Python 3.6 but pip3 missing, cannot install tensorflow
                            
                                "django.contrib.admin.sites.NotRegistered: The model User is not registered" I get this error when a want to register my Custom User.
                            
                                pandas dataframe: how to count the number of 1 rows in a binary column?
                            
                                Pandas dataframe first instance of value in column
                            
                                How to calculate Cohen's kappa coefficient that measures inter-rater agreement ? ( movie review )
                            
                                How do I get Flake8 to work with F811 errors?
                            
                                How to use Bazel's py_library imports argument
                            
                                how to send photo by telegram bot using multipart/form-data
                            
                                In C python, accessing the bytecode evaluation stack
                            
                                How can I use advanced regex in a boto3 ec2 instance filter?
                            
                                Logging to logstash from python
                            
                                What is the meaning of 'mean_test_score' in cv_result?
                            
                                I need to create a python list object, or any object, out of a pandas DataFrame object grouping pieces of values from different rows
                            
                                Keras model to fit polynomial
                            
                                openCV: How to use getPerspectiveTransform

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With