I am writing a Scrapy spider that crawls a set of URLs once per day. However, some of these websites are very big, so I cannot crawl the full site daily, nor would I want to generate the massive traffic necessary to do so.
An old question (here) asked something similar. However, the upvoted response simply points to a code snippet (here), which seems to require something of the request instance, though that is not explained in the response, nor on the page containing the code snippet.
I'm trying to make sense of this but find middleware a bit confusing. A complete example of a scraper which can be be run multiple times without rescraping URLs would be very useful, whether or not it uses the linked middleware.
I've posted code below to get the ball rolling but I don't necessarily need to use this middleware. Any scrapy spider that can crawl daily and extract new URLs will do. Obviously one solution is to just write out a dictionary of scraped URLs and then check to confirm that each new URL is/isn't in the dictionary, but that seems very slow/inefficient.
Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from cnn_scrapy.items import NewspaperItem
class NewspaperSpider(CrawlSpider):
name = "newspaper"
allowed_domains = ["cnn.com"]
start_urls = [
"http://www.cnn.com/"
]
rules = (
Rule(LinkExtractor(), callback="parse_item", follow=True),
)
def parse_item(self, response):
self.log("Scraping: " + response.url)
item = NewspaperItem()
item["url"] = response.url
yield item
Items
import scrapy
class NewspaperItem(scrapy.Item):
url = scrapy.Field()
visit_id = scrapy.Field()
visit_status = scrapy.Field()
Middlewares (ignore.py)
from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint
from cnn_scrapy.items import NewspaperItem
class IgnoreVisitedItems(object):
"""Middleware to ignore re-visiting item pages if they were already visited
before. The requests to be filtered by have a meta['filter_visited'] flag
enabled and optionally define an id to use for identifying them, which
defaults the request fingerprint, although you'd want to use the item id,
if you already have it beforehand to make it more robust.
"""
FILTER_VISITED = 'filter_visited'
VISITED_ID = 'visited_id'
CONTEXT_KEY = 'visited_ids'
def process_spider_output(self, response, result, spider):
context = getattr(spider, 'context', {})
visited_ids = context.setdefault(self.CONTEXT_KEY, {})
ret = []
for x in result:
visited = False
if isinstance(x, Request):
if self.FILTER_VISITED in x.meta:
visit_id = self._visited_id(x)
if visit_id in visited_ids:
log.msg("Ignoring already visited: %s" % x.url,
level=log.INFO, spider=spider)
visited = True
elif isinstance(x, BaseItem):
visit_id = self._visited_id(response.request)
if visit_id:
visited_ids[visit_id] = True
x['visit_id'] = visit_id
x['visit_status'] = 'new'
if visited:
ret.append(NewspaperItem(visit_id=visit_id, visit_status='old'))
else:
ret.append(x)
return ret
def _visited_id(self, request):
return request.meta.get(self.VISITED_ID) or request_fingerprint(request)
This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and an Item will be filled with it.
The default spiders of Scrapy are as follows − It is a spider from which every other spiders must inherit. It has the following class − The following table shows the fields of scrapy.Spider class − It is the name of your spider. It is a list of domains on which the spider crawls.
You can use generic spiders to subclass your spiders from. Their aim is to follow all links on the website based on certain rules to extract data from all pages. CrawlSpider defines a set of rules to follow the links and scrap more than one page. It has the following class −
It is the name of your spider. It is a list of domains on which the spider crawls. It is a list of URLs, which will be the roots for later crawls, where the spider will begin to crawl from.
Here's the thing, what you want to do is to be able to have one database of which your crawl is scheduled/croned. dupflier.middleware or not your still having to scrape the entire site regardless... and I feel despite the obviousness that the code provided cant be the entire project, that that WAYY too much code.
I'm not exactly sure what it is that you were scraping but I'm going to assume right now you have CNN as the projects URL that you're scraping articles?
what I would do would be to use CNNs RSS feeds or even site map given that provides due date with the article meta and using the OS module...
Define the date each crawl instance Using regex restrict the itemization with the crawlers defined date against the date articles posted deploy and schedule crawl to/in scrapinghub Use scrapinghubs python api client to iterate through items
Still would crawl entire sites content but with a xmlspider or rssspider class is perfect for parsing all that data more quickly... And now that the db is available in a "cloud" ... I feel one could be more modular with the scale-ability of the project as well much easier portability/cross-compatibility
Im sure the flow im describing would be subject to some tinkering but the idea is straight forward.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With