I am using scrapy 1.1 to scrape a website. The site requires periodic relogin. I can tell when this is needed because when login is required a 302 redirection occurs. Based on # http://sangaline.com/post/advanced-web-scraping-tutorial/ , I have subclassed the RedirectMiddleware, making the location http header available in the spider under:
request.meta['redirect_urls']
My problem is that after logging in , I have set up a function to loop through 100 pages to scrape . Lets say after 15 pages I see that I have to log back in (based on the contents of request.meta['redirect_urls']) . My code looks like:
def test1(self, response):
......
for row in empties: # 100 records
d = object_as_dict(row)
AA
yield Request(url=myurl,headers=self.headers, callback=self.parse_lookup, meta={d':d}, dont_filter=True)
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
print str(response.meta['redirect_urls'])
BB
d = response.meta['d']
So as you can see, I get 'notified' of the need to relogin in parse_lookup at BB , but need to feed this information back to cancel the loop creating requests in test1 (AA). How can I make the information in parse lookup available in the prior callback function?
Requests and Responses ¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
request ( scrapy.http.Request) – the initial value of the Response.request attribute. This represents the Request that generated this response. certificate ( twisted.internet.ssl.Certificate) – an object representing the server’s SSL certificate.
In the code above, first we enter Scrapy shell by using scrapy shell commands, after that, we can use some built-in commands in scrapy shell to help us. For example, we can use fetch to help us to send http request and get the response for us. You can get the detail of the HTTP response by accessing property of the response object.
Why not use a DownloaderMiddleware?
You could write a DownloaderMiddleware like so:
Edit: I have edited the original code to address a second problem the OP had in the comments.
from scrapy.http import Request
class CustomMiddleware():
def process_response(self, request, response, spider):
if 'redirect_urls' in response.meta:
# assuming your spider has a method for handling the login
original_url = response.meta["redirect_urls"][0]
return Request(url="login_url",
callback=spider.login,
meta={"original_url": original_url})
return response
So you "intercept" the response before it goes to the parse_lookup and relogin/fix what is wrong and yield new requests...
Like Tomáš Linhart said the requests are asynchronous so I don't know if you could run into problems by "reloging in" several times in a row, as multiple requests might be redirected at the same time.
Remember to add the middleware to your settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 542,
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
You can't achieve what you want because Scrapy uses asynchronous processing.
In theory you could use approach partially suggested in comment by @Paulo Scardine, i.e. raise an exception in parse_lookup
. For it to be useful, you would then have to code your spider middleware and handle this exception in process_spider_exception
method to log back in and retry failed requests.
But I think better and simpler approach would be to do the same once you detect the need to login, i.e. in parse_lookup
. Not sure exactly how CONCURRENT_REQUESTS_PER_DOMAIN
works, but setting this to 1
might let you process one request at time and so there should be no failing requests as you always log back in when you need to.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With