I am trying to crawl sites in a very basic manner. But Scrapy isn't crawling all the links. I will explain the scenario as follows-
main_page.html -> contains links to a_page.html, b_page.html, c_page.html
a_page.html -> contains links to a1_page.html, a2_page.html
b_page.html -> contains links to b1_page.html, b2_page.html
c_page.html -> contains links to c1_page.html, c2_page.html
a1_page.html -> contains link to b_page.html
a2_page.html -> contains link to c_page.html
b1_page.html -> contains link to a_page.html
b2_page.html -> contains link to c_page.html
c1_page.html -> contains link to a_page.html
c2_page.html -> contains link to main_page.html
I am using the following rule in CrawlSpider -
Rule(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))
But the crawl results are as follows -
DEBUG: Crawled (200) http://localhost/main_page.html> (referer: None) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a_page.html> (referer: http://localhost/main_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a1_page.html> (referer: http://localhost/a_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b_page.html> (referer: http://localhost/a1_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b1_page.html> (referer: http://localhost/b_page.html) 2011-12-05 09:56:07+0530 [test_spider] INFO: Closing spider (finished)
It is not crawling all the pages.
NB - I have made the crawling in BFO as it was indicated in the Scrapy Doc.
What am I missing?
Scrapy will by default filter out all duplicate requests.
You can circumvent this by using (example):
yield Request(url="test.com", callback=self.callback, dont_filter = True)
dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
Also see the Request object documentation
I had a similar problem today, although I was using a custom spider. It turned out that the website was limiting my crawl because my useragent was scrappy-bot
try changing your user agent and try again. Change it to maybe that of a known browser
Another thing you might want to try is adding a delay. Some websites prevent scraping if the time between request is too small. Try adding a DOWNLOAD_DELAY of 2 and see if that helps
More information about DOWNLOAD_DELAY at http://doc.scrapy.org/en/0.14/topics/settings.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With