Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy get request url in parse

How can I get the request url in Scrapy's parse() function? I have a lot of urls in start_urls and some of them redirect my spider to homepage and as result I have an empty item. So I need something like item['start_url'] = request.url to store these urls. I'm using the BaseSpider.

like image 484
Goran Avatar asked Nov 19 '13 20:11

Goran


People also ask

How do you get redirected URL in Scrapy?

open() returns a file-like object with two additional methods, one of them being geturl() which returns the final URL (after all redirects have been followed). Its not part of Scrapy, but it works.

How do I find the response URL?

Start at the documentation. Search it for redirect and you'll find: followRedirect - follow HTTP 3xx responses as redirects (default: true). This property can also be implemented as function which gets response object as a single argument and should return true if redirects should continue or false otherwise.

What does Scrapy request return?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.


2 Answers

The 'response' variable that's passed to parse() has the info you want. You shouldn't need to override anything.

eg. (EDITED)

def parse(self, response):     print "URL: " + response.request.url 
like image 128
Jagu Avatar answered Sep 24 '22 00:09

Jagu


The request object is accessible from the response object, therefore you can do the following:

def parse(self, response):     item['start_url'] = response.request.url 
like image 35
gusridd Avatar answered Sep 23 '22 00:09

gusridd