How to limit number of followed pages per site in Python Scrapy

Question

I am trying to build a spider that could efficiently scrape text information from many websites. Since I am a Python user I was referred to Scrapy. However, in order to avoid scraping huge websites, I want to limit the spider to scrape no more than 20 pages of a certain "depth" per website. Here is my spider:

class DownloadSpider(CrawlSpider):
    name = 'downloader'
    download_path = '/home/MyProjects/crawler'
    rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)

    def __init__(self, *args, **kwargs):
        super(DownloadSpider, self).__init__(*args, **kwargs)
        self.urls_file_path = [kwargs.get('urls_file')]
        data = open(self.urls_file_path[0], 'r').readlines()
        self.allowed_domains = [urlparse(i).hostname.strip() for i in data] 
        self.start_urls = ['http://' + domain for domain in self.allowed_domains]

    def parse_start_url(self, response):
        return self.parse_item(response)

    def parse_item(self, response):
        self.fname = self.download_path + urlparse(response.url).hostname.strip()
        open(str(self.fname)+ '.txt', 'a').write(response.url)
        open(str(self.fname)+ '.txt', 'a').write('
')

urls_file is a path to a text file with urls. I have also set the max depth in the settings file. Here is my problem: if I set the CLOSESPIDER_PAGECOUNT exception it closes the spider when the total number of scraped pages (regardless for which site) reaches the exception value. However, I need to stop scraping when I have scraped say 20 pages from each url. I also tried keeping count with a variable like self.parsed_number += 1, but this didn't work either -- it seems that scrapy doesn't go url by url but mixes them up. Any advice is much appreciated !

Roman Arkharov · Accepted Answer

To do this you can create your own link extractor class based on SgmlLinkExtractor. It should look something like this:

from scrapy.selector import Selector
from scrapy.utils.response import get_base_url

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class LimitedLinkExtractor(SgmlLinkExtractor):
    def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
                 tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None,
                 deny_extensions=None, max_pages=20):
        self.max_pages=max_pages

        SgmlLinkExtractor.__init__(self, allow=allow, deny=deny, allow_domains=allow_domains, deny_domains=deny_domains, restrict_xpaths=restrict_xpaths,
                 tags=tags, attrs=attrs, canonicalize=canonicalize, unique=unique, process_value=process_value,
                 deny_extensions=deny_extensions)

    def extract_links(self, response):
        base_url = None
        if self.restrict_xpaths:
            sel = Selector(response)
            base_url = get_base_url(response)
            body = u''.join(f
                            for x in self.restrict_xpaths
                            for f in sel.xpath(x).extract()
                            ).encode(response.encoding, errors='xmlcharrefreplace')
        else:
            body = response.body

        links = self._extract_links(body, response.url, response.encoding, base_url)
        links = self._process_links(links)
        links = links[0:self.max_pages]
        return links

The code of this subclass completely based on the code of the class SgmlLinkExtractor. I've just added variable self.max_pages to the class constructor and line which cut the list of links in the end of extract_links method. But you can cut this list in more intelligent way.

alecxe · Answer

I'd make per-class variable, initialize it with stats = defaultdict(int) and increment self.stats[response.url] (or may be the key could be a tuple like (website, depth) in your case) in parse_item.

This is how I imagine this - should work in theory. Let me know if you need an example.

FYI, you can extract base url and calculate depth with the help of urlparse.urlparse (see docs).

How to limit number of followed pages per site in Python Scrapy

Tags:

python

scrapy

web-crawler

gpanterov

2 Answers

Roman Arkharov

alecxe

Recent Activity

Donate For Us

How to limit number of followed pages per site in Python Scrapy

Tags:

python

scrapy

web-crawler

gpanterov

2 Answers

Roman Arkharov

alecxe

Related questions

Recent Activity

Donate For Us