python/scrapy question: How to avoid endless loops

Question

I am using the web-scraping framework, scrapy, to data mine some sites. I am trying to use the CrawlSpider and the pages have a 'back' and 'next' button. The URLs are in the format

www.qwerty.com/###

where ### is a number that increments every time the next button is pressed. How do I format the rules so that an endless loop doesn't occur.

Here is my rule:

rules = (
        Rule(SgmlLinkExtractor(allow='http://not-a-real-site.com/trunk-framework/791'),follow=True,callback='parse_item',
    ),
)

user · Accepted Answer

Endless loop shouldn't happen. Scrapy will filter out duplicate urls.

python/scrapy question: How to avoid endless loops

Tags:

python

loops

scrapy

web-crawler

ProgrammingAnt

1 Answers

user

Recent Activity

Donate For Us

python/scrapy question: How to avoid endless loops

Tags:

python

loops

scrapy

web-crawler

ProgrammingAnt

1 Answers

user

Related questions

Recent Activity

Donate For Us