I would like to parse kickstarter.com projects using scrapy, but can't figure out how to make the spider search projects that I don't explicitly specify under start_urls. I have the first part of the scrapy code figured out (I can extract the necessary information from one website), I just can't get it to do this for all projects under the domain kickstarter.com/projects.
From what I've read, I believe that parsing is possible (1) using links on the starting page (kickstarter.com/projects), (2) using links from one project page to jump to another project, and (3) using a site map (which I don't think kickstarter.com has) to locate webpages to parse.
I've spent hours trying each of these methods but and I am getting nowhere.
I've used the scrapy tutorial code and built on it.
Here is the part so far that works:
from scrapy import log
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import kickstarteritem
class kickstarter(CrawlSpider):
name = 'kickstarter'
allowed_domains = ['kickstarter.com']
start_urls = ["http://www.kickstarter.com/projects/brucegoldwell/dragon-keepers-book-iv-fantasy-mystery-magic"]
def parse(self, response):
x = HtmlXPathSelector(response)
item = kickstarteritem()
item['url'] = response.url
item['name'] = x.select("//div[@class='NS-project_-running_board']/h2[@id='title']/a/text()").extract()
item['launched'] = x.select("//li[@class='posted']/text()").extract()
item['ended'] = x.select("//li[@class='ends']/text()").extract()
item['backers'] = x.select("//span[@class='count']/data[@data-format='number']/@data-value").extract()
item['pledge'] = x.select("//div[@class='num']/@data-pledged").extract()
item['goal'] = x.select("//div[@class='num']/@data-goal").extract()
return item
Since you're subclassing CrawlSpider
, do not override parse
. CrawlSpider
's link crawling logic is contained within parse
, which you really need.
As for the crawling itself, that's what the rules
class attribute is for. I haven't tested it, but it should work:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
from tutorial.items import kickstarteritem
class kickstarter(CrawlSpider):
name = 'kickstarter'
allowed_domains = ['kickstarter.com']
start_urls = ['http://www.kickstarter.com/discover/recently-launched']
rules = (
Rule(
SgmlLinkExtractor(allow=r'\?page=\d+'),
follow=True
),
Rule(
SgmlLinkExtractor(allow=r'/projects/'),
callback='parse_item'
)
)
def parse_item(self, response):
xpath = HtmlXPathSelector(response)
loader = XPathItemLoader(item=kickstarteritem(), response=response)
loader.add_value('url', response.url)
loader.add_xpath('name', '//div[@class="NS-project_-running_board"]/h2[@id="title"]/a/text()')
loader.add_xpath('launched', '//li[@class="posted"]/text()')
loader.add_xpath('ended', '//li[@class="ends"]/text()')
loader.add_xpath('backers', '//span[@class="count"]/data[@data-format="number"]/@data-value')
loader.add_xpath('pledge', '//div[@class="num"]/@data-pledged')
loader.add_xpath('goal', '//div[@class="num"]/@data-goal')
yield loader.load_item()
The spider crawls the pages of the recently launched projects.
Also, use yield
instead of return
. It's better to keep your spider's output a generator and it lets you yield multiple items/requests without making a list to hold them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With