Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stop Scrapy after N items scraped

Tags:

python

scrapy

I'm having trouble with Scrapy. I need code that will scrap up to 1000 internal links per given url. My code works when run at command line, but the spider doesn't stop, only receives the message.

My code is as follows:

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.contrib.closespider import CloseSpider

class MyItem(Item):
    url= Field()

class MySpider(CrawlSpider):
    name = 'testspider1'
    allowed_domains = ['angieslist.com']
    start_urls = ['http://www.angieslist.com']

    rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=True), )

    def parse_url(self, response):
        item = MyItem()
        item['url'] = response.url

        scrape_count = self.crawler.stats.get_value('item_scraped_count')
        print scrape_count

        limit = 10

        if scrape_count == limit:
            raise CloseSpider('Limit Reached')

        return item
like image 334
Josh Usre Avatar asked Jul 06 '15 12:07

Josh Usre


People also ask

How do I quit Scrapy?

Exit Scrapy shell with the exit() command.

How do you stop a Scrapy shell?

Try ctrl+c twice to terminate and ctrl+z+Enter to exit.

Is Scrapy asynchronous?

Scrapy is asynchronous by default. Using coroutine syntax, introduced in Scrapy 2.0, simply allows for a simpler syntax when using Twisted Deferreds, which are not needed in most use cases, as Scrapy makes its usage transparent whenever possible.

How do you get a response from Scrapy request?

Request usage examples You can use the FormRequest. from_response() method for this job. Here's an example spider which uses it: import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded.


1 Answers

My problem was trying to apply close spider in the wrong place. It's a variable that needs to be set in the settings.py file. When I set it manually in there, or set it as a argument in the command line, it worked (Stopping within 10-20 of N for what it's worth).

settings.py:

BOT_NAME = 'internal_links'
SPIDER_MODULES = ['internal_links.spiders']
NEWSPIDER_MODULE = 'internal_links.spiders'
CLOSESPIDER_PAGECOUNT = 1000
ITEM_PIPELINES = ['internal_links.pipelines.CsvWriterPipeline']
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'yo mama'
LOG_LEVEL = 'DEBUG'
like image 89
Josh Usre Avatar answered Sep 27 '22 23:09

Josh Usre