Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to stop scrapy spider after certain number of requests?

I am developing an simple scraper to get 9 gag posts and its images but due to some technical difficulties iam unable to stop the scraper and it keeps on scraping which i dont want.I want to increase the counter value and stop after 100 posts. But the 9gag page was designed in a fashion in each response it gives only 10 posts and after each iteration my counter value resets to 10 in this case my loop runs infintely long and never stops.


# -*- coding: utf-8 -*-
import scrapy
from _9gag.items import GagItem

class FirstSpider(scrapy.Spider):
    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = (
        'http://www.9gag.com/',
    )

    last_gag_id = None
    def parse(self, response):
        count = 0
        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            count +=1
            if gag_id:
                if (count != 100):
                    last_gag_id = gag_id[0]
                    ninegag_item = GagItem()
                    ninegag_item['entry_id'] = gag_id[0]
                    ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
                    ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
                    ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
                    ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
                    ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()

                    yield ninegag_item


                else:
                    break


        next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id
        yield scrapy.Request(url=next_url, callback=self.parse) 
        print count

Code for items.py is here

from scrapy.item import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()

So i want to increase a global count value and tried this by passing 3 arguments to parse function it gives error

TypeError: parse() takes exactly 3 arguments (2 given)

So is there a way to pass a global count value and return it after each iteration and stop after 100 posts(suppose).

Entire project is available here Github Even if i set POST_LIMIT =100 the infinite loop happens,see here command i executed

scrapy crawl first -s POST_LIMIT=10 --output=output.json
like image 877
SaiKiran Avatar asked Mar 02 '16 13:03

SaiKiran


People also ask

How do you close a Scrapy spider?

To force spider to close you can use raise CloseSpider exception as described here in scrapy docs. Just be sure to return/yield your items before you raise the exception.

How do you stop a Scrapy shell?

Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling: >>> ^D 2014-01-23 17:50:03-0400 [scrapy.

How do you slow down a Scrapy?

if you want to keep a download delay of exactly one second, setting DOWNLOAD_DELAY=1 is the way to do it. But scrapy also has a feature to automatically set download delays called AutoThrottle . It automatically sets delays based on load of both the Scrapy server and the website you are crawling.

What is concurrent requests in Scrapy?

In the context of Scrapy, this means to send out “concurrent” requests instead of sending them one by one. In other words, this means that the Scrapy spider will send a X number of (simultaneous) requests to the web server at the same time.


2 Answers

There's a built-in setting CLOSESPIDER_PAGECOUNT that can be passed via command-line -s argument or changed in settings: scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100

One small caveat is that if you've enabled caching, it will count cache hits as page counts as well.

like image 119
Dennis Avatar answered Oct 04 '22 11:10

Dennis


First: Use self.count and initialize outside of parse. Then don't prevent the parsing of the items, but generating new requests. See the following code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()


class FirstSpider(scrapy.Spider):

    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = ('http://www.9gag.com/', )

    last_gag_id = None
    COUNT_MAX = 30
    count = 0

    def parse(self, response):

        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            ninegag_item = GagItem()
            ninegag_item['entry_id'] = gag_id[0]
            ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
            ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
            ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
            ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
            ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
            self.last_gag_id = gag_id[0]
            self.count = self.count + 1
            yield ninegag_item

        if (self.count < self.COUNT_MAX):
            next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
            yield scrapy.Request(url=next_url, callback=self.parse)
like image 37
Frank Martin Avatar answered Oct 04 '22 11:10

Frank Martin