Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to change selector for ItemLoader

Tags:

python

scrapy

I'm trying to populate item using ItemLoader parsing data from multiple pages. But as I can see now, I can't change selector that I used when I initialized ItemLoader. And documentation says about selector attribute:

selector

The Selector object to extract data from. It’s either the selector given in the constructor or one created from the response given in the constructor using the default_selector_class. This attribute is meant to be read-only.

Here's example code:

def parse(self, response):
    sel = Selector(response)
    videos = sel.xpath('//div[@class="video"]')

    for video in videos:
        loader = ItemLoader(VideoItem(), videos)
        loader.add_xpath('original_title', './/u/text()')
        loader.add_xpath('original_id', './/a[@class="hRotator"]/@href', re=r'movies/(\d+)/.+\.html')

        try:
            url = video.xpath('.//a[@class="hRotator"]/@href').extract()[0]
            request = Request(url,
                      callback=self.parse_video_page)
        except IndexError:
            pass

        request.meta['loader'] = loader
        yield request

    pages = sel.xpath('//div[@class="pager"]//a/@href').extract()
    for page in pages:
        url = urlparse.urljoin('http://www.mysite.com/', page)
        request = Request(url, callback=self.parse)
        yield request

def parse_video_page(self, response):
    loader = response.meta['loader']
    sel = Selector(response)

    loader.add_xpath('original_description', '//*[@id="videoInfo"]//td[@class="desc"]/h2/text()')
    loader.add_xpath('duration', '//*[@id="video-info"]/div[2]/text()')
    loader.add_xpath('tags', '//*[@id="tags"]//a/text()')

    item = loader.load_item()

    return item

As for now, I can't scrape info from the second page.

like image 805
Dmitrii Mikhailov Avatar asked Apr 07 '14 18:04

Dmitrii Mikhailov


1 Answers

Answering to your question directly - to change selector for ItemLoader you can set new selector object to loader.selector attribute.

def parse_video_page(self, response):
    loader = response.meta['loader']
    sel = Selector(response)
    loader.selector = sel

    loader.add_xpath(
        'original_description', 
        '//*[@id="videoInfo"]//td[@class="desc"]/h2/text()'
    )
    # ...

But this way of working with loader objects seems to be unexpected and thus - not supported - library updates can break this code or produce unexpected bugs. Also passing loader to request meta is a bad thing to do, because loader object references response object - and this can cause memory problems in some situations.

Much more correct way of collecting item fields in several callbacks would be as follows (note the comments):

def parse(self, response):
    sel = Selector(response)
    videos = sel.xpath('//div[@class="video"]')

    for video in videos:
        try:
            url = video.xpath('.//a[@class="hRotator"]/@href').extract()[0]
        except IndexError:
            continue
        loader = ItemLoader(VideoItem(), videos)
        loader.add_xpath('original_title', './/u/text()')
        loader.add_xpath(
            'original_id', 
            './/a[@class="hRotator"]/@href', 
            re=r'movies/(\d+)/.+\.html'
        )
        item = loader.load_item()
        yield Request(
            urlparse.urljoin(response.url, url),
            callback=self.parse_video_page,
            # Note: item passed to the meta dict, not loader itself
            meta={'item': item}
        )

    pages = sel.xpath('//div[@class="pager"]//a/@href').extract()
    for page in pages:
        url = urlparse.urljoin('http://www.mysite.com/', page)
        yield Request(url, callback=self.parse)

def parse_video_page(self, response):
    item = response.meta['item']

    # Note: new loader object created, 
    # item from response.meta is passed to the constructor
    loader = ItemLoader(item, response=response)
    loader.add_xpath(
        'original_description', 
        '//*[@id="videoInfo"]//td[@class="desc"]/h2/text()'
    )
    loader.add_xpath(
        'duration', 
        '//*[@id="video-info"]/div[2]/text()'
    )
    loader.add_xpath('tags', '//*[@id="tags"]//a/text()')
    return loader.load_item()
like image 158
Alex Chekunkov Avatar answered Sep 29 '22 22:09

Alex Chekunkov