Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python/scrapy question: How to avoid endless loops

I am using the web-scraping framework, scrapy, to data mine some sites. I am trying to use the CrawlSpider and the pages have a 'back' and 'next' button. The URLs are in the format

www.qwerty.com/###

where ### is a number that increments every time the next button is pressed. How do I format the rules so that an endless loop doesn't occur.

Here is my rule:

rules = (
        Rule(SgmlLinkExtractor(allow='http://not-a-real-site.com/trunk-framework/791'),follow=True,callback='parse_item',
    ),
)
like image 433
ProgrammingAnt Avatar asked May 23 '26 14:05

ProgrammingAnt


1 Answers

Endless loop shouldn't happen. Scrapy will filter out duplicate urls.

like image 132
user Avatar answered May 25 '26 09:05

user



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!