How can I get the request url in Scrapy's parse()
function? I have a lot of urls in start_urls
and some of them redirect my spider to homepage and as result I have an empty item. So I need something like item['start_url'] = request.url
to store these urls. I'm using the BaseSpider.
open() returns a file-like object with two additional methods, one of them being geturl() which returns the final URL (after all redirects have been followed). Its not part of Scrapy, but it works.
Start at the documentation. Search it for redirect and you'll find: followRedirect - follow HTTP 3xx responses as redirects (default: true). This property can also be implemented as function which gets response object as a single argument and should return true if redirects should continue or false otherwise.
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
The 'response' variable that's passed to parse() has the info you want. You shouldn't need to override anything.
eg. (EDITED)
def parse(self, response): print "URL: " + response.request.url
The request object is accessible from the response object, therefore you can do the following:
def parse(self, response): item['start_url'] = response.request.url
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With