Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy spider that only crawls URLs once

I am writing a Scrapy spider that crawls a set of URLs once per day. However, some of these websites are very big, so I cannot crawl the full site daily, nor would I want to generate the massive traffic necessary to do so.

An old question (here) asked something similar. However, the upvoted response simply points to a code snippet (here), which seems to require something of the request instance, though that is not explained in the response, nor on the page containing the code snippet.

I'm trying to make sense of this but find middleware a bit confusing. A complete example of a scraper which can be be run multiple times without rescraping URLs would be very useful, whether or not it uses the linked middleware.

I've posted code below to get the ball rolling but I don't necessarily need to use this middleware. Any scrapy spider that can crawl daily and extract new URLs will do. Obviously one solution is to just write out a dictionary of scraped URLs and then check to confirm that each new URL is/isn't in the dictionary, but that seems very slow/inefficient.

Spider

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from cnn_scrapy.items import NewspaperItem



class NewspaperSpider(CrawlSpider):
    name = "newspaper"
    allowed_domains = ["cnn.com"]
    start_urls = [
        "http://www.cnn.com/"
    ]

    rules = (
        Rule(LinkExtractor(), callback="parse_item", follow=True),
    )

    def parse_item(self, response):
        self.log("Scraping: " + response.url)
        item = NewspaperItem()
        item["url"] = response.url
        yield item

Items

import scrapy


class NewspaperItem(scrapy.Item):
    url = scrapy.Field()
    visit_id = scrapy.Field()
    visit_status = scrapy.Field()

Middlewares (ignore.py)

from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint

from cnn_scrapy.items import NewspaperItem

class IgnoreVisitedItems(object):
    """Middleware to ignore re-visiting item pages if they were already visited
    before. The requests to be filtered by have a meta['filter_visited'] flag
    enabled and optionally define an id to use for identifying them, which
    defaults the request fingerprint, although you'd want to use the item id,
    if you already have it beforehand to make it more robust.
    """

    FILTER_VISITED = 'filter_visited'
    VISITED_ID = 'visited_id'
    CONTEXT_KEY = 'visited_ids'

    def process_spider_output(self, response, result, spider):
        context = getattr(spider, 'context', {})
        visited_ids = context.setdefault(self.CONTEXT_KEY, {})
        ret = []
        for x in result:
            visited = False
            if isinstance(x, Request):
                if self.FILTER_VISITED in x.meta:
                    visit_id = self._visited_id(x)
                    if visit_id in visited_ids:
                        log.msg("Ignoring already visited: %s" % x.url,
                                level=log.INFO, spider=spider)
                        visited = True
            elif isinstance(x, BaseItem):
                visit_id = self._visited_id(response.request)
                if visit_id:
                    visited_ids[visit_id] = True
                    x['visit_id'] = visit_id
                    x['visit_status'] = 'new'
            if visited:
                ret.append(NewspaperItem(visit_id=visit_id, visit_status='old'))
            else:
                ret.append(x)
        return ret

    def _visited_id(self, request):
        return request.meta.get(self.VISITED_ID) or request_fingerprint(request)
like image 477
Henry David Thorough Avatar asked Jun 10 '16 01:06

Henry David Thorough


People also ask

How does a spider crawl a web page?

This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and an Item will be filled with it.

What are the default spiders of Scrapy?

The default spiders of Scrapy are as follows − It is a spider from which every other spiders must inherit. It has the following class − The following table shows the fields of scrapy.Spider class − It is the name of your spider. It is a list of domains on which the spider crawls.

What is the difference between crawlspider and generic spiders?

You can use generic spiders to subclass your spiders from. Their aim is to follow all links on the website based on certain rules to extract data from all pages. CrawlSpider defines a set of rules to follow the links and scrap more than one page. It has the following class −

What does it mean when a spider has a URL?

It is the name of your spider. It is a list of domains on which the spider crawls. It is a list of URLs, which will be the roots for later crawls, where the spider will begin to crawl from.


1 Answers

Here's the thing, what you want to do is to be able to have one database of which your crawl is scheduled/croned. dupflier.middleware or not your still having to scrape the entire site regardless... and I feel despite the obviousness that the code provided cant be the entire project, that that WAYY too much code.

I'm not exactly sure what it is that you were scraping but I'm going to assume right now you have CNN as the projects URL that you're scraping articles?

what I would do would be to use CNNs RSS feeds or even site map given that provides due date with the article meta and using the OS module...

Define the date each crawl instance Using regex restrict the itemization with the crawlers defined date against the date articles posted deploy and schedule crawl to/in scrapinghub Use scrapinghubs python api client to iterate through items

Still would crawl entire sites content but with a xmlspider or rssspider class is perfect for parsing all that data more quickly... And now that the db is available in a "cloud" ... I feel one could be more modular with the scale-ability of the project as well much easier portability/cross-compatibility

Im sure the flow im describing would be subject to some tinkering but the idea is straight forward.

like image 118
scriptso Avatar answered Oct 10 '22 08:10

scriptso