I am developing an simple scraper to get 9 gag posts and its images but due to some technical difficulties iam unable to stop the scraper and it keeps on scraping which i dont want.I want to increase the counter value and stop after 100 posts. But the 9gag page was designed in a fashion in each response it gives only 10 posts and after each iteration my counter value resets to 10 in this case my loop runs infintely long and never stops.
# -*- coding: utf-8 -*-
import scrapy
from _9gag.items import GagItem
class FirstSpider(scrapy.Spider):
name = "first"
allowed_domains = ["9gag.com"]
start_urls = (
'http://www.9gag.com/',
)
last_gag_id = None
def parse(self, response):
count = 0
for article in response.xpath('//article'):
gag_id = article.xpath('@data-entry-id').extract()
count +=1
if gag_id:
if (count != 100):
last_gag_id = gag_id[0]
ninegag_item = GagItem()
ninegag_item['entry_id'] = gag_id[0]
ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
yield ninegag_item
else:
break
next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id
yield scrapy.Request(url=next_url, callback=self.parse)
print count
Code for items.py is here
from scrapy.item import Item, Field
class GagItem(Item):
entry_id = Field()
url = Field()
votes = Field()
comments = Field()
title = Field()
img_url = Field()
So i want to increase a global count value and tried this by passing 3 arguments to parse function it gives error
TypeError: parse() takes exactly 3 arguments (2 given)
So is there a way to pass a global count value and return it after each iteration and stop after 100 posts(suppose).
Entire project is available here Github Even if i set POST_LIMIT =100 the infinite loop happens,see here command i executed
scrapy crawl first -s POST_LIMIT=10 --output=output.json
To force spider to close you can use raise CloseSpider exception as described here in scrapy docs. Just be sure to return/yield your items before you raise the exception.
Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling: >>> ^D 2014-01-23 17:50:03-0400 [scrapy.
if you want to keep a download delay of exactly one second, setting DOWNLOAD_DELAY=1 is the way to do it. But scrapy also has a feature to automatically set download delays called AutoThrottle . It automatically sets delays based on load of both the Scrapy server and the website you are crawling.
In the context of Scrapy, this means to send out “concurrent” requests instead of sending them one by one. In other words, this means that the Scrapy spider will send a X number of (simultaneous) requests to the web server at the same time.
There's a built-in setting CLOSESPIDER_PAGECOUNT
that can be passed via command-line -s
argument or changed in settings: scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100
One small caveat is that if you've enabled caching, it will count cache hits as page counts as well.
First: Use self.count
and initialize outside of parse
. Then don't prevent the parsing of the items, but generating new requests
. See the following code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field
class GagItem(Item):
entry_id = Field()
url = Field()
votes = Field()
comments = Field()
title = Field()
img_url = Field()
class FirstSpider(scrapy.Spider):
name = "first"
allowed_domains = ["9gag.com"]
start_urls = ('http://www.9gag.com/', )
last_gag_id = None
COUNT_MAX = 30
count = 0
def parse(self, response):
for article in response.xpath('//article'):
gag_id = article.xpath('@data-entry-id').extract()
ninegag_item = GagItem()
ninegag_item['entry_id'] = gag_id[0]
ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
self.last_gag_id = gag_id[0]
self.count = self.count + 1
yield ninegag_item
if (self.count < self.COUNT_MAX):
next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
yield scrapy.Request(url=next_url, callback=self.parse)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With