I'm having trouble with Scrapy. I need code that will scrap up to 1000 internal links per given url. My code works when run at command line, but the spider doesn't stop, only receives the message.
My code is as follows:
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.contrib.closespider import CloseSpider
class MyItem(Item):
url= Field()
class MySpider(CrawlSpider):
name = 'testspider1'
allowed_domains = ['angieslist.com']
start_urls = ['http://www.angieslist.com']
rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=True), )
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
scrape_count = self.crawler.stats.get_value('item_scraped_count')
print scrape_count
limit = 10
if scrape_count == limit:
raise CloseSpider('Limit Reached')
return item
Exit Scrapy shell with the exit() command.
Try ctrl+c twice to terminate and ctrl+z+Enter to exit.
Scrapy is asynchronous by default. Using coroutine syntax, introduced in Scrapy 2.0, simply allows for a simpler syntax when using Twisted Deferreds, which are not needed in most use cases, as Scrapy makes its usage transparent whenever possible.
Request usage examples You can use the FormRequest. from_response() method for this job. Here's an example spider which uses it: import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded.
My problem was trying to apply close spider in the wrong place. It's a variable that needs to be set in the settings.py file. When I set it manually in there, or set it as a argument in the command line, it worked (Stopping within 10-20 of N for what it's worth).
settings.py:
BOT_NAME = 'internal_links'
SPIDER_MODULES = ['internal_links.spiders']
NEWSPIDER_MODULE = 'internal_links.spiders'
CLOSESPIDER_PAGECOUNT = 1000
ITEM_PIPELINES = ['internal_links.pipelines.CsvWriterPipeline']
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'yo mama'
LOG_LEVEL = 'DEBUG'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With