Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - parsing all sub-pages of a given domain

I would like to parse kickstarter.com projects using scrapy, but can't figure out how to make the spider search projects that I don't explicitly specify under start_urls. I have the first part of the scrapy code figured out (I can extract the necessary information from one website), I just can't get it to do this for all projects under the domain kickstarter.com/projects.

From what I've read, I believe that parsing is possible (1) using links on the starting page (kickstarter.com/projects), (2) using links from one project page to jump to another project, and (3) using a site map (which I don't think kickstarter.com has) to locate webpages to parse.

I've spent hours trying each of these methods but and I am getting nowhere.

I've used the scrapy tutorial code and built on it.

Here is the part so far that works:

from scrapy import log
from scrapy.contrib.spiders import CrawlSpider   
from scrapy.selector import HtmlXPathSelector  

from tutorial.items import kickstarteritem

class kickstarter(CrawlSpider):
    name = 'kickstarter'
    allowed_domains = ['kickstarter.com']    
    start_urls = ["http://www.kickstarter.com/projects/brucegoldwell/dragon-keepers-book-iv-fantasy-mystery-magic"]

    def parse(self, response):
        x = HtmlXPathSelector(response)

        item = kickstarteritem()
        item['url'] = response.url
        item['name'] = x.select("//div[@class='NS-project_-running_board']/h2[@id='title']/a/text()").extract()
        item['launched'] = x.select("//li[@class='posted']/text()").extract()
        item['ended'] = x.select("//li[@class='ends']/text()").extract()
        item['backers'] = x.select("//span[@class='count']/data[@data-format='number']/@data-value").extract()
        item['pledge'] = x.select("//div[@class='num']/@data-pledged").extract()
        item['goal'] = x.select("//div[@class='num']/@data-goal").extract()
        return item
like image 534
Boiler_Maker Avatar asked Mar 14 '13 02:03

Boiler_Maker


1 Answers

Since you're subclassing CrawlSpider, do not override parse. CrawlSpider's link crawling logic is contained within parse, which you really need.

As for the crawling itself, that's what the rules class attribute is for. I haven't tested it, but it should work:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector

from tutorial.items import kickstarteritem

class kickstarter(CrawlSpider):
    name = 'kickstarter'
    allowed_domains = ['kickstarter.com']    
    start_urls = ['http://www.kickstarter.com/discover/recently-launched']

    rules = (
        Rule(
            SgmlLinkExtractor(allow=r'\?page=\d+'),
            follow=True
        ),
        Rule(
            SgmlLinkExtractor(allow=r'/projects/'),
            callback='parse_item'
        )
    )

    def parse_item(self, response):
        xpath = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=kickstarteritem(), response=response)

        loader.add_value('url', response.url)
        loader.add_xpath('name', '//div[@class="NS-project_-running_board"]/h2[@id="title"]/a/text()')
        loader.add_xpath('launched', '//li[@class="posted"]/text()')
        loader.add_xpath('ended', '//li[@class="ends"]/text()')
        loader.add_xpath('backers', '//span[@class="count"]/data[@data-format="number"]/@data-value')
        loader.add_xpath('pledge', '//div[@class="num"]/@data-pledged')
        loader.add_xpath('goal', '//div[@class="num"]/@data-goal')

        yield loader.load_item()

The spider crawls the pages of the recently launched projects.

Also, use yield instead of return. It's better to keep your spider's output a generator and it lets you yield multiple items/requests without making a list to hold them.

like image 192
Blender Avatar answered Sep 21 '22 18:09

Blender