Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python's scrapy doesn't seem to get data from all available URLs

I'm trying to scrape thesession.org to create a table of how many times each tune has been added to memeber's tunebooks so I can find some popular pieces to learn. I've started with the scrapy tutorial here and am trying to modify it to suit my purposes. The problem is that although the thesession.org website appears to have some 10,390 tunes, my scraper only returns data on 10 of them (only the ones on http://www.thesession.org/tunes/index.php). How can I get data on all the tunes (or the top-ranked hundred tunes)? Any advice would be greatly appreciated.

Here's what I've got so far:

items.py

from scrapy.item import Item, Field

class tuneItem(Item):
    url = Field()
    name1 = Field()
    name2 = Field()
    key = Field()
    count = Field() 
    pass

tune_spider.py

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from tutorial.items import tuneItem
from scrapy.conf import settings

class tunesSpider(CrawlSpider):

    name = "irishtunes"
    allowed_domains = ["thesession.org"]
    start_urls = ["http://www.thesession.org/tunes"]
    rules = [Rule(SgmlLinkExtractor(allow=['/display/\d+'], deny=['/members/','/recordings/','/index/','/display/\d+/.']), 'parse_tune')]

    def parse_tune(self, response):
        x = HtmlXPathSelector(response)

        tune = tuneItem()
        tune['url'] = response.url
        tune['name1'] = x.select("//div[@id='details']//div[@class='box']/h1/text()").extract()
        tune['name2'] = x.select("//div[@id='details']//div[@class='box']/h2/text()").extract()
        tune['key']   = x.select("//div[@id='details']//div[@class='box']/p[1]/text()").extract()
        tune['count'] = x.select("//div[@id='details']//div[@class='box']/p[3]/text()").re('\d+')
        return tune

I run the scraper by opening my console, going to directory containing tutorial's cfg file, and running scrapy crawl irishtunes --set FEED_URI=scraped_data.csv --set FEED_FORMAT=csv

Here is what I get:

C:\Users\BM\Desktop\scrape\tutorial>scrapy crawl irishtunes --set FEED_URI=scrap
ed_data.csv --set FEED_FORMAT=csv
2011-11-25 22:45:47-0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: tutoria
l)
2011-11-25 22:45:47-0800 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled item pipelines:
2011-11-25 22:45:48-0800 [irishtunes] INFO: Spider opened
2011-11-25 22:45:48-0800 [irishtunes] INFO: Crawled 0 pages (at 0 pages/min), sc
raped 0 items (at 0 items/min)
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Redirecting (301) to <GET http://ww
w.thesession.org/tunes/> from <GET http://www.thesession.org/tunes>
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/> (referer: None)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11602> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11602>
        {'count': [u'1'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Brendan Begley's"],
         'name2': [u'polka'],
         'url': 'http://www.thesession.org/tunes/display/11602'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11593> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11593>
        {'count': [u'3'],
         'key': [u'Key signature: Amajor'],
         'name1': [u'Carleton County Breakdown'],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11593'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11597> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11597>
        {'count': [u'3'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Kasper's Rant"],
         'name2': [u'hornpipe'],
         'url': 'http://www.thesession.org/tunes/display/11597'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11594> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11594>
        {'count': [u'5'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u'The Full Of The Bag'],
         'name2': [u'hornpipe'],
         'url': 'http://www.thesession.org/tunes/display/11594'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11599> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11599>
        {'count': [u'1'],
         'key': [u'Key signature: Adorian'],
         'name1': [u'The New Steamboat'],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11599'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11598> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11598>
        {'count': [u'4'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u"Galen's Arrival"],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11598'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11596> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11596>
        {'count': [u'2'],
         'key': [u'Key signature: Amixolydian'],
         'name1': [u'Culloden Day'],
         'name2': [u'strathspey'],
         'url': 'http://www.thesession.org/tunes/display/11596'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11595> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11595>
        {'count': [u'2'],
         'key': [u'Key signature: Aminor'],
         'name1': [u'Miss Sine Flemington'],
         'name2': [u'barndance'],
         'url': 'http://www.thesession.org/tunes/display/11595'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11600> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11600>
        {'count': [u'2'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Joan Martin's"],
         'name2': [u'polka'],
         'url': 'http://www.thesession.org/tunes/display/11600'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11601> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11601>
        {'count': [u'2'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u'My Time Inside 2005'],
         'name2': [u'waltz'],
         'url': 'http://www.thesession.org/tunes/display/11601'}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Closing spider (finished)
2011-11-25 22:45:49-0800 [irishtunes] INFO: Stored csv feed (10 items) in: scrap
ed_data.csv
2011-11-25 22:45:49-0800 [irishtunes] INFO: Dumping spider stats:
        {'downloader/request_bytes': 3655,
         'downloader/request_count': 12,
         'downloader/request_method_count/GET': 12,
         'downloader/response_bytes': 31620,
         'downloader/response_count': 12,
         'downloader/response_status_count/200': 11,
         'downloader/response_status_count/301': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2011, 11, 26, 6, 45, 49, 500000),
         'item_scraped_count': 10,
         'request_depth_max': 1,
         'scheduler/memory_enqueued': 12,
         'start_time': datetime.datetime(2011, 11, 26, 6, 45, 48, 10000)}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Spider closed (finished)
2011-11-25 22:45:49-0800 [scrapy] INFO: Dumping global stats:
        {}

EDIT: The answer from @reclosedev got me on the way. For anyone wondering about the outcome, here's a snapshot...

(1) The vast majority of tunes are less than 10 members' tunebooks

enter image description here

(2) The popularity of all 10,379 tunes that I could scrape from the site (as measured by how many tunebooks they are in) follows a power-law distribution

enter image description here

(3) And here are the tunes that are in >1000 tunebooks on the site, showing the names of the top-ranked tunes and how many tunebooks they are in

enter image description here

like image 203
Ben Avatar asked Nov 26 '11 06:11

Ben


1 Answers

You need to add Rule, which will extract links to all pages, and spider will follow it:

rules = [
    ..., #your existing parse_tune rule
    Rule(
        SgmlLinkExtractor(
             allow=('/index/new\?new_start=\d+',)
        ),
        follow=True,
    ),
]

edit:

follow=True is not necessary, because callback=None means follow=True by default.

like image 146
reclosedev Avatar answered Oct 19 '22 20:10

reclosedev