Combining base url with resultant href in scrapy

Tags:

below is my spider code,

class Blurb2Spider(BaseSpider):
   name = "blurb2"
   allowed_domains = ["www.domain.com"]

   def start_requests(self):
            yield self.make_requests_from_url("http://www.domain.com/bookstore/new")


   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       urls = hxs.select('//div[@class="bookListingBookTitle"]/a/@href').extract()
       for i in urls:
           yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url)


   def parse_url(self, response):
       hxs = HtmlXPathSelector(response)
       print response,'------->'

Here i am trying to combine the href link with the base link , but i am getting the following error ,

exceptions.ValueError: Missing scheme in request url: www.domain.com//bookstore/detail/3271993?alt=Something+I+Had+To+Do

Can anyone let me know why i am getting this error and how to join base url with href link and yield a request

242

asked May 29 '12 11:05

Shiva Krishna Bavandla

3 Answers

An alternative solution, if you don't want to use urlparse:

response.urljoin(i[1:])

This solution goes even a step further: here Scrapy works out the domain base for joining. And as you can see, you don't have to provide the obvious http://www.example.com for joining.

This makes your code reusable in the future if you want to change the domain you are crawling.

answered Oct 07 '22 23:10

GHajba

The best way to follow a link in scrapy is to use response.follow(). scrapy will handle the rest.

more info

Quote from docs:

Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin.

Also, you can pass <a> element directly as argument.

answered Oct 07 '22 22:10

Ali Kazemkhanloo

It is because you didn't add the scheme, eg http:// in your base url.

Try: urlparse.urljoin('http://www.domain.com/', i[1:])

Or even more easy: urlparse.urljoin(response.url, i[1:]) as urlparse.urljoin will sort out the base URL itself.

answered Oct 07 '22 23:10

Sjaak Trekhaak

Related questions
                            
                                Sort list of string based on number in string [duplicate]
                            
                                Python - how to read an image from a URL?
                            
                                All row sum with pandas except one
                            
                                Python in Google Cloud Functions
                            
                                SSL: CERTIFICATE_VERIFY_FAILED error with python3 on macOS 10.15
                            
                                CSV file with Arabic characters is displayed as symbols in Excel
                            
                                Cannot import SQLite with Python 2.6
                            
                                In Python, is there a concise way to use a list comprehension with multiple iterators?
                            
                                What is the best way to toggle python prints?
                            
                                How Do I Use A Decimal Number In A Django URL Pattern?
                            
                                CPython is bytecode interpreter?
                            
                                What is Ruby's analog to Python Metaclasses?
                            
                                Why don't you need a powerful ide for writing Python? [closed]
                            
                                How to create a dictionary from a line of text?
                            
                                Is it faster to union sets or check the whole list for a duplicate?
                            
                                Printing unescaped white space to shell
                            
                                STATIC_URL undefined in base Django template
                            
                                django south migration, doesnt set default
                            
                                Compare date and datetime in Django
                            
                                python isdigit() function return true for non digit character u'\u2466'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Combining base url with resultant href in scrapy

Tags:

python

url

scrapy