Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining base url with resultant href in scrapy

Tags:

python

url

scrapy

below is my spider code,

class Blurb2Spider(BaseSpider):
   name = "blurb2"
   allowed_domains = ["www.domain.com"]

   def start_requests(self):
            yield self.make_requests_from_url("http://www.domain.com/bookstore/new")


   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       urls = hxs.select('//div[@class="bookListingBookTitle"]/a/@href').extract()
       for i in urls:
           yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url)


   def parse_url(self, response):
       hxs = HtmlXPathSelector(response)
       print response,'------->'

Here i am trying to combine the href link with the base link , but i am getting the following error ,

exceptions.ValueError: Missing scheme in request url: www.domain.com//bookstore/detail/3271993?alt=Something+I+Had+To+Do

Can anyone let me know why i am getting this error and how to join base url with href link and yield a request

like image 242
Shiva Krishna Bavandla Avatar asked May 29 '12 11:05

Shiva Krishna Bavandla


People also ask

How do I join a Scrapy URL?

response. urljoin − The parse() method will use this method to build a new url and provide a new request, which will be sent later to callback. parse_dir_contents() − This is a callback which will actually scrape the data of interest.

What is callback in Scrapy?

The callback of a request is a function that will be called when the response of that request is downloaded. The callback function will be called with the downloaded Response object as its first argument. Example: def parse_page1(self, response): return scrapy.

What is Start_urls in Scrapy?

start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that.

Is Scrapy asynchronous?

Scrapy is asynchronous by default. Using coroutine syntax, introduced in Scrapy 2.0, simply allows for a simpler syntax when using Twisted Deferreds, which are not needed in most use cases, as Scrapy makes its usage transparent whenever possible.


3 Answers

An alternative solution, if you don't want to use urlparse:

response.urljoin(i[1:])

This solution goes even a step further: here Scrapy works out the domain base for joining. And as you can see, you don't have to provide the obvious http://www.example.com for joining.

This makes your code reusable in the future if you want to change the domain you are crawling.

like image 84
GHajba Avatar answered Oct 07 '22 23:10

GHajba


The best way to follow a link in scrapy is to use response.follow(). scrapy will handle the rest.

more info

Quote from docs:

Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin.

Also, you can pass <a> element directly as argument.

like image 29
Ali Kazemkhanloo Avatar answered Oct 07 '22 22:10

Ali Kazemkhanloo


It is because you didn't add the scheme, eg http:// in your base url.

Try: urlparse.urljoin('http://www.domain.com/', i[1:])

Or even more easy: urlparse.urljoin(response.url, i[1:]) as urlparse.urljoin will sort out the base URL itself.

like image 35
Sjaak Trekhaak Avatar answered Oct 07 '22 23:10

Sjaak Trekhaak