below is my spider code,
class Blurb2Spider(BaseSpider):
name = "blurb2"
allowed_domains = ["www.domain.com"]
def start_requests(self):
yield self.make_requests_from_url("http://www.domain.com/bookstore/new")
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//div[@class="bookListingBookTitle"]/a/@href').extract()
for i in urls:
yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
print response,'------->'
Here i am trying to combine the href link with the base link , but i am getting the following error ,
exceptions.ValueError: Missing scheme in request url: www.domain.com//bookstore/detail/3271993?alt=Something+I+Had+To+Do
Can anyone let me know why i am getting this error and how to join base url with href link and yield a request
response. urljoin − The parse() method will use this method to build a new url and provide a new request, which will be sent later to callback. parse_dir_contents() − This is a callback which will actually scrape the data of interest.
The callback of a request is a function that will be called when the response of that request is downloaded. The callback function will be called with the downloaded Response object as its first argument. Example: def parse_page1(self, response): return scrapy.
start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that.
Scrapy is asynchronous by default. Using coroutine syntax, introduced in Scrapy 2.0, simply allows for a simpler syntax when using Twisted Deferreds, which are not needed in most use cases, as Scrapy makes its usage transparent whenever possible.
An alternative solution, if you don't want to use urlparse
:
response.urljoin(i[1:])
This solution goes even a step further: here Scrapy works out the domain base for joining. And as you can see, you don't have to provide the obvious http://www.example.com
for joining.
This makes your code reusable in the future if you want to change the domain you are crawling.
The best way to follow a link in scrapy
is to use response.follow()
. scrapy will handle the rest.
more info
Quote from docs:
Unlike
scrapy.Request
,response.follow
supports relative URLs directly - no need to callurljoin
.
Also, you can pass <a>
element directly as argument.
It is because you didn't add the scheme, eg http:// in your base url.
Try: urlparse.urljoin('http://www.domain.com/', i[1:])
Or even more easy: urlparse.urljoin(response.url, i[1:])
as urlparse.urljoin will sort out the base URL itself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With