How to stop scrapy spider after certain number of requests?

Tags:

I am developing an simple scraper to get 9 gag posts and its images but due to some technical difficulties iam unable to stop the scraper and it keeps on scraping which i dont want.I want to increase the counter value and stop after 100 posts. But the 9gag page was designed in a fashion in each response it gives only 10 posts and after each iteration my counter value resets to 10 in this case my loop runs infintely long and never stops.

# -*- coding: utf-8 -*-
import scrapy
from _9gag.items import GagItem

class FirstSpider(scrapy.Spider):
    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = (
        'http://www.9gag.com/',
    )

    last_gag_id = None
    def parse(self, response):
        count = 0
        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            count +=1
            if gag_id:
                if (count != 100):
                    last_gag_id = gag_id[0]
                    ninegag_item = GagItem()
                    ninegag_item['entry_id'] = gag_id[0]
                    ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
                    ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
                    ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
                    ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
                    ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()

                    yield ninegag_item


                else:
                    break


        next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id
        yield scrapy.Request(url=next_url, callback=self.parse) 
        print count

Code for items.py is here

from scrapy.item import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()

So i want to increase a global count value and tried this by passing 3 arguments to parse function it gives error

TypeError: parse() takes exactly 3 arguments (2 given)

So is there a way to pass a global count value and return it after each iteration and stop after 100 posts(suppose).

Entire project is available here Github Even if i set POST_LIMIT =100 the infinite loop happens,see here command i executed

scrapy crawl first -s POST_LIMIT=10 --output=output.json

877

asked Mar 02 '16 13:03

SaiKiran

2 Answers

There's a built-in setting CLOSESPIDER_PAGECOUNT that can be passed via command-line -s argument or changed in settings: scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100

One small caveat is that if you've enabled caching, it will count cache hits as page counts as well.

119

answered Oct 04 '22 11:10

Dennis

First: Use self.count and initialize outside of parse. Then don't prevent the parsing of the items, but generating new requests. See the following code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()


class FirstSpider(scrapy.Spider):

    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = ('http://www.9gag.com/', )

    last_gag_id = None
    COUNT_MAX = 30
    count = 0

    def parse(self, response):

        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            ninegag_item = GagItem()
            ninegag_item['entry_id'] = gag_id[0]
            ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
            ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
            ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
            ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
            ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
            self.last_gag_id = gag_id[0]
            self.count = self.count + 1
            yield ninegag_item

        if (self.count < self.COUNT_MAX):
            next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
            yield scrapy.Request(url=next_url, callback=self.parse)

answered Oct 04 '22 11:10

Frank Martin

Related questions
                            
                                Django can't find template directory
                            
                                python list comprehension (if, continue, break)
                            
                                Django "TemplateDoesNotExist " Error but "Using loader django.template.loaders.app_directories.Loader" File Exists
                            
                                How to correctly break a long line in Python?
                            
                                Python Ranking Dictionary Return Rank
                            
                                using a conditional and lambda in map
                            
                                Python Pandas: Convert nested dictionary to dataframe
                            
                                Convert Two column data frame to occurrence matrix in pandas
                            
                                Change Icon For Tkinter Messagebox
                            
                                Why there is need to push django migrations to version control system
                            
                                Python share values
                            
                                Using OpenWeatherMap API gives 401 error
                            
                                How to sum negative and positive values separately when using groupby in pandas?
                            
                                Python PIL - Finding Nearest Color (Rounding Colors)
                            
                                Finding highest values in each row in a data frame for python
                            
                                Pairwise haversine distance calculation
                            
                                Run app from Flask-Migrate manager
                            
                                NumPy calculate square of norm 2 of vector
                            
                                Python Click Library Rename Argument
                            
                                Load custom image from file system in scikit-image

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to stop scrapy spider after certain number of requests?

Tags:

python

loops

python-3.x

python-2.7

scrapy

SaiKiran

People also ask

2 Answers

Dennis

Frank Martin

Recent Activity

Donate For Us