I use the XMLFeedSpider in Scrapy to scrap a real estate website.
Each url request generated by my spider (via start_urls) return a page in XML with a bunch of ads and a link to the next page (search results is limited to 50 ads).
I was therefore wondering how i could add this additional page as new request in my spider ?
I've been searching through stackoverflow for a while but i just can't find a simple answer to my problem !
Below is the code i have in my spider. I have updated it with the parse_nodes() method mentioned by Paul but the next url is not picked up for some reasons.
Could i yield additional requests in the adapt_response method ?
from scrapy.spider import log
from scrapy.selector import XmlXPathSelector
from scrapy.contrib.spiders import XMLFeedSpider
from crawler.items import RefItem, PicItem
from crawler.seloger_helper import urlbuilder
from scrapy.http import Request
class Seloger_spider_XML(XMLFeedSpider):
name = 'Seloger_spider_XML'
allowed_domains = ['seloger.com']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'annonce'
'''Spider Initialized with department as argument'''
def __init__(self, departement=None, *args, **kwargs):
super(Seloger_spider_XML, self).__init__(*args, **kwargs)
#self.start_urls = urlbuilder(departement) #helper function which generate start_urls
self.start_urls = ['http://ws.seloger.com/search.xml?cp=72&idtt=2&tri=d_dt_crea&SEARCHpg=1']
def parse_node(self, response, node):
items = []
item = RefItem()
item['ref'] = int(''.join(node.select('//annonce/idAnnonce/text()').extract()))
item['desc'] = ''.join(node.select('//annonce/descriptif/text()').extract()).encode('utf-8')
item['libelle'] = ''.join(node.select('//annonce/libelle/text()').extract()).encode('utf-8')
item['titre'] = ''.join(node.select('//annonce/titre/text()').extract()).encode('utf-8')
item['ville'] = ''.join(node.select('//annonce/ville/text()').extract()).encode('utf-8')
item['url'] =''.join(node.select('//annonce/permaLien/text()').extract()).encode('utf-8')
item['prix'] = ''.join(node.select('//annonce/prix/text()').extract())
item['prixunite'] = ''.join(node.select('//annonce/prixUnite/text()').extract())
item['datemaj'] = ''.join(node.select('//annonce/dtFraicheur/text()').extract())[:10]
item['datecrea'] = ''.join(node.select('//annonce/dtCreation/text()').extract())[:10]
item['lati'] = (''.join(node.select('//annonce/latitude/text()').extract()))
item['longi'] = (''.join(node.select('//annonce/longitude/text()').extract()))
item['surface'] = (''.join(node.select('//annonce/surface/text()').extract()))
item['surfaceunite'] = (''.join(node.select('//annonce/surfaceUnite/text()').extract()))
item['piece'] = (''.join(node.select('//annonce/nbPiece/text()').extract())).encode('utf-8')
item['ce'] = (''.join(node.select('//annonce/dbilanEmissionGES/text()').extract())).encode('utf-8')
items.append(item)
for photos in node.select('//annonce/photos'):
for link in photos.select('photo/thbUrl/text()').extract():
pic = PicItem()
pic['pic'] = link.encode('utf-8')
pic['refpic'] = item['ref']
items.append(pic)
return items
def parse_nodes(self, response, nodes):
for n in super(Seloger_spider_XML, self).parse_nodes(response, nodes):
yield n
# once you're done with item/nodes
# look for the next page link using XPath
# these lines are borrowed form
# https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/feed.py#L73
selector = XmlXPathSelector(response)
self._register_namespaces(selector)
for link_url in selector.select('//pageSuivante/text()').extract():
yield Request(link_url)
Thank you Gilles
Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
if you want to keep a download delay of exactly one second, setting DOWNLOAD_DELAY=1 is the way to do it. But scrapy also has a feature to automatically set download delays called AutoThrottle . It automatically sets delays based on load of both the Scrapy server and the website you are crawling.
You can override the parse_nodes()
method to hook in your "next page" URL extraction.
The example below is based on Scrapy docs XMLFeedExample:
from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import TestItem
from scrapy.selector import XmlXPathSelector
from scrapy.http import Request
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def parse_node(self, response, node):
log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))
item = Item()
item['id'] = node.select('@id').extract()
item['name'] = node.select('name').extract()
item['description'] = node.select('description').extract()
return item
def parse_nodes(self, response, nodes):
# call built-in method that itself calls parse_node()
# and yield whatever it returns
for n in super(MySpider, self).parse_nodes(response, nodes):
yield n
# once you're done with item/nodes
# look for the next page link using XPath
# these lines are borrowed form
# https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/feed.py#L73
selector = XmlXPathSelector(response)
self._register_namespaces(selector)
for link_url in selector.select('//pageSuivante/text()').extract():
print "link_url", link_url
yield Request(link_url)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With