I am using the web-scraping framework, scrapy, to data mine some sites. I am trying to use the CrawlSpider and the pages have a 'back' and 'next' button. The URLs are in the format
www.qwerty.com/###
where ### is a number that increments every time the next button is pressed. How do I format the rules so that an endless loop doesn't occur.
Here is my rule:
rules = (
Rule(SgmlLinkExtractor(allow='http://not-a-real-site.com/trunk-framework/791'),follow=True,callback='parse_item',
),
)
Endless loop shouldn't happen. Scrapy will filter out duplicate urls.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With