Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crawling multiple sites with Python Scrapy with limited depth per site

I am new to Scrapy and I am trying to crawl multiple sites from a text file with CrawlSpider. However I would like to limit the depth of scraping per site and also the total number of crawled pages again per web site. Unfortunately, when the start_urls and allowed_domains attributes are set the response.meta['depth'] always seems to be zero (this doesn't happen when I am trying to scrape individual sites). Setting the DEPTH_LIMIT in the settings file doesn't seem to do anything at all. When I remove the init definition and simply set the start_urls and allowed_domains things seem to be working fine. Here is the code (Sorry for the indentation -- this is not the issue):

class DownloadSpider(CrawlSpider):
  name = 'downloader'
  rules = (
    Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),
    )
  def __init__(self, urls_file, N=10):
      data = open(urls_file, 'r').readlines()[:N]
      self.allowed_domains = [urlparse(i).hostname.strip() for i in data] 
      self.start_urls = ['http://' + domain for domain in self.allowed_domains]

  def parse_start_url(self, response):
      return self.parse_item(response)

  def parse_item(self, response):
      print response.url
      print response.meta['depth']

This results in response.meta['depth'] always equal to zero and the cralwer only crawls the very first site of each element of start_urls (i.e. it doesn't follow any links). So I have two questions 1) How to limit the crawl to a certain depth per each site in start_urls 2) How to limit the total number of crawls per site irrespective of the depth

Thanks !

like image 299
gpanterov Avatar asked Apr 06 '13 01:04

gpanterov


1 Answers

Don't forget to call the base class constructors (for example with super):

def __init__(self, urls_file, N=10, *a, **kw):
    data = open(urls_file, 'r').readlines()[:N]
    self.allowed_domains = [urlparse(i).hostname.strip() for i in data]
    self.start_urls = ['http://' + domain for domain in self.allowed_domains]
    super(DownloadSpider, self).__init__(*a, **kw)

UPDATE:

When you override a method in Python the base class method is no longer called and instead your new method is called, this means that if you want your new logic to run in addition to the old logic (i.e. not instead of), then you need to call the old logic manually.

Here is the logic that you were missing by not calling the CrawlSpider.__init__() (via super(DownloadSpider, self).__init__()):

self._compile_rules()
like image 158
Steven Almeroth Avatar answered Nov 04 '22 02:11

Steven Almeroth