Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add new requests for my Scrapy Spider during crawling

Tags:

python

scrapy

I use the XMLFeedSpider in Scrapy to scrap a real estate website.

Each url request generated by my spider (via start_urls) return a page in XML with a bunch of ads and a link to the next page (search results is limited to 50 ads).

I was therefore wondering how i could add this additional page as new request in my spider ?

I've been searching through stackoverflow for a while but i just can't find a simple answer to my problem !

Below is the code i have in my spider. I have updated it with the parse_nodes() method mentioned by Paul but the next url is not picked up for some reasons.

Could i yield additional requests in the adapt_response method ?

from scrapy.spider import log
from scrapy.selector import XmlXPathSelector
from scrapy.contrib.spiders import XMLFeedSpider
from crawler.items import RefItem, PicItem
from crawler.seloger_helper import urlbuilder
from scrapy.http import Request

class Seloger_spider_XML(XMLFeedSpider):
    name = 'Seloger_spider_XML'
    allowed_domains = ['seloger.com']
    iterator = 'iternodes' # This is actually unnecessary, since it's the default value
    itertag = 'annonce'  

'''Spider Initialized with department as argument'''
def __init__(self, departement=None, *args, **kwargs):
    super(Seloger_spider_XML, self).__init__(*args, **kwargs)
    #self.start_urls = urlbuilder(departement) #helper function which generate start_urls
    self.start_urls = ['http://ws.seloger.com/search.xml?cp=72&idtt=2&tri=d_dt_crea&SEARCHpg=1']

def parse_node(self, response, node):

    items = []
    item = RefItem()  

    item['ref'] = int(''.join(node.select('//annonce/idAnnonce/text()').extract()))
    item['desc'] = ''.join(node.select('//annonce/descriptif/text()').extract()).encode('utf-8')
    item['libelle'] = ''.join(node.select('//annonce/libelle/text()').extract()).encode('utf-8')
    item['titre'] = ''.join(node.select('//annonce/titre/text()').extract()).encode('utf-8')
    item['ville'] = ''.join(node.select('//annonce/ville/text()').extract()).encode('utf-8')
    item['url'] =''.join(node.select('//annonce/permaLien/text()').extract()).encode('utf-8')
    item['prix'] = ''.join(node.select('//annonce/prix/text()').extract())
    item['prixunite'] = ''.join(node.select('//annonce/prixUnite/text()').extract())
    item['datemaj'] = ''.join(node.select('//annonce/dtFraicheur/text()').extract())[:10]
    item['datecrea'] = ''.join(node.select('//annonce/dtCreation/text()').extract())[:10]
    item['lati'] = (''.join(node.select('//annonce/latitude/text()').extract()))
    item['longi'] = (''.join(node.select('//annonce/longitude/text()').extract()))
    item['surface'] = (''.join(node.select('//annonce/surface/text()').extract()))
    item['surfaceunite'] = (''.join(node.select('//annonce/surfaceUnite/text()').extract()))
    item['piece'] = (''.join(node.select('//annonce/nbPiece/text()').extract())).encode('utf-8')
    item['ce'] = (''.join(node.select('//annonce/dbilanEmissionGES/text()').extract())).encode('utf-8')

    items.append(item)

    for photos in node.select('//annonce/photos'):
            for link in photos.select('photo/thbUrl/text()').extract():
                pic = PicItem()
                pic['pic'] = link.encode('utf-8')
                pic['refpic'] = item['ref']
                items.append(pic)

    return items

    def parse_nodes(self, response, nodes):
        for n in super(Seloger_spider_XML, self).parse_nodes(response, nodes):
            yield n
    # once you're done with item/nodes
    # look for the next page link using XPath
    # these lines are borrowed form
    # https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/feed.py#L73
        selector = XmlXPathSelector(response)
        self._register_namespaces(selector)
        for link_url in selector.select('//pageSuivante/text()').extract():
            yield Request(link_url) 

Thank you Gilles

like image 743
Gilles Avatar asked Oct 06 '13 17:10

Gilles


People also ask

How do I make a Scrapy request?

Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.

How do you use Scrapy requests?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

How do you run multiple spiders in a Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.

How do you do a delayed request on Scrapy?

if you want to keep a download delay of exactly one second, setting DOWNLOAD_DELAY=1 is the way to do it. But scrapy also has a feature to automatically set download delays called AutoThrottle . It automatically sets delays based on load of both the Scrapy server and the website you are crawling.


1 Answers

You can override the parse_nodes() method to hook in your "next page" URL extraction.

The example below is based on Scrapy docs XMLFeedExample:

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import TestItem
from scrapy.selector import XmlXPathSelector
from scrapy.http import Request

class MySpider(XMLFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.xml']
    iterator = 'iternodes' # This is actually unnecessary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, node):
        log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))

        item = Item()
        item['id'] = node.select('@id').extract()
        item['name'] = node.select('name').extract()
        item['description'] = node.select('description').extract()
        return item

    def parse_nodes(self, response, nodes):
        # call built-in method that itself calls parse_node()
        # and yield whatever it returns
        for n in super(MySpider, self).parse_nodes(response, nodes):
            yield n

        # once you're done with item/nodes
        # look for the next page link using XPath
        # these lines are borrowed form
        # https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/feed.py#L73
        selector = XmlXPathSelector(response)
        self._register_namespaces(selector)
        for link_url in selector.select('//pageSuivante/text()').extract():
            print "link_url", link_url
            yield Request(link_url)
like image 64
paul trmbrth Avatar answered Nov 15 '22 12:11

paul trmbrth