My scrapy code doesn't work and I have no clue ! I want to scrape the Ikea website, I designed first a CrawlSpider which was not specific enough to retrieve every links of the webpage. So I designed a basic Spider with yield request method.
Here is my code :
class IkeaSpider(scrapy.Spider) :
name = "Ikea"
allower_domains = ["http://www.ikea.com/"]
start_urls = ["http://www.ikea.com/fr/fr/catalog/productsaz/8/"]
def parse_url(self, response):
for sel in response.xpath('//div[@id="productsAzLeft"]'):
base_url = 'http://www.ikea.com/'
follow_url = sel.xpath('//span[@class="productsAzLink"]/@href').extract()
complete_url = urlparse.urljoin(base_url, follow_url)
request = Request(complete_url, callback = self.parse_page)
yield request
def parse_page(self, response):
And here is the log of errors :
2016-01-04 22:06:31 [scrapy] ERROR: Spider error processing <GET http://www.ikea.com/fr/fr/catalog/productsaz/8/> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 76, in parse
raise NotImplementedError
NotImplementedError
Your spider needs a parse
method which is the default callback for all initial requests. You can just rename the parse_url
method to parse
and it will work fine.
class IkeaSpider(scrapy.Spider) :
name = "Ikea"
allower_domains = ["http://www.ikea.com/"]
start_urls = ["http://www.ikea.com/fr/fr/catalog/productsaz/8/"]
def parse(self, response):
for sel in response.xpath('//div[@id="productsAzLeft"]'):
base_url = 'http://www.ikea.com/'
follow_url = sel.xpath('//span[@class="productsAzLink"]/@href').extract()
complete_url = urlparse.urljoin(base_url, follow_url)
request = Request(complete_url, callback = self.parse_page)
yield request
You can also define a start_requests
method and yield scrapy.Requests
manually with a defined callback
argument just like you did here.
You have to implement the parse
method if you only want to use start_urls
from a spider, as you can check here
the parse
method is the default callback for the requests made from the urls inside start_urls
.
If you want to control the requests from the start, you can also use the start_requests
method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With