I am trying to build a spider that could efficiently scrape text information from many websites. Since I am a Python user I was referred to Scrapy. However, in order to avoid scraping huge websites, I want to limit the spider to scrape no more than 20 pages of a certain "depth" per website. Here is my spider:
class DownloadSpider(CrawlSpider):
name = 'downloader'
download_path = '/home/MyProjects/crawler'
rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)
def __init__(self, *args, **kwargs):
super(DownloadSpider, self).__init__(*args, **kwargs)
self.urls_file_path = [kwargs.get('urls_file')]
data = open(self.urls_file_path[0], 'r').readlines()
self.allowed_domains = [urlparse(i).hostname.strip() for i in data]
self.start_urls = ['http://' + domain for domain in self.allowed_domains]
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
self.fname = self.download_path + urlparse(response.url).hostname.strip()
open(str(self.fname)+ '.txt', 'a').write(response.url)
open(str(self.fname)+ '.txt', 'a').write('\n')
urls_file is a path to a text file with urls. I have also set the max depth in the settings file. Here is my problem: if I set the CLOSESPIDER_PAGECOUNT
exception it closes the spider when the total number of scraped pages (regardless for which site) reaches the exception value. However, I need to stop scraping when I have scraped say 20 pages from each url.
I also tried keeping count with a variable like self.parsed_number += 1, but this didn't work either -- it seems that scrapy doesn't go url by url but mixes them up.
Any advice is much appreciated !
To do this you can create your own link extractor class based on SgmlLinkExtractor. It should look something like this:
from scrapy.selector import Selector
from scrapy.utils.response import get_base_url
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class LimitedLinkExtractor(SgmlLinkExtractor):
def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None,
deny_extensions=None, max_pages=20):
self.max_pages=max_pages
SgmlLinkExtractor.__init__(self, allow=allow, deny=deny, allow_domains=allow_domains, deny_domains=deny_domains, restrict_xpaths=restrict_xpaths,
tags=tags, attrs=attrs, canonicalize=canonicalize, unique=unique, process_value=process_value,
deny_extensions=deny_extensions)
def extract_links(self, response):
base_url = None
if self.restrict_xpaths:
sel = Selector(response)
base_url = get_base_url(response)
body = u''.join(f
for x in self.restrict_xpaths
for f in sel.xpath(x).extract()
).encode(response.encoding, errors='xmlcharrefreplace')
else:
body = response.body
links = self._extract_links(body, response.url, response.encoding, base_url)
links = self._process_links(links)
links = links[0:self.max_pages]
return links
The code of this subclass completely based on the code of the class SgmlLinkExtractor. I've just added variable self.max_pages to the class constructor and line which cut the list of links in the end of extract_links method. But you can cut this list in more intelligent way.
I'd make per-class variable, initialize it with stats = defaultdict(int)
and increment self.stats[response.url]
(or may be the key could be a tuple like (website, depth)
in your case) in parse_item
.
This is how I imagine this - should work in theory. Let me know if you need an example.
FYI, you can extract base url and calculate depth with the help of urlparse.urlparse
(see docs).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With