Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy not crawling all the pages

Tags:

python

scrapy

I am trying to crawl sites in a very basic manner. But Scrapy isn't crawling all the links. I will explain the scenario as follows-

main_page.html -> contains links to a_page.html, b_page.html, c_page.html
a_page.html -> contains links to a1_page.html, a2_page.html
b_page.html -> contains links to b1_page.html, b2_page.html
c_page.html -> contains links to c1_page.html, c2_page.html
a1_page.html -> contains link to b_page.html
a2_page.html -> contains link to c_page.html
b1_page.html -> contains link to a_page.html
b2_page.html -> contains link to c_page.html
c1_page.html -> contains link to a_page.html
c2_page.html -> contains link to main_page.html

I am using the following rule in CrawlSpider -

Rule(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))

But the crawl results are as follows -

DEBUG: Crawled (200) http://localhost/main_page.html> (referer: None) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a_page.html> (referer: http://localhost/main_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a1_page.html> (referer: http://localhost/a_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b_page.html> (referer: http://localhost/a1_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b1_page.html> (referer: http://localhost/b_page.html) 2011-12-05 09:56:07+0530 [test_spider] INFO: Closing spider (finished)

It is not crawling all the pages.

NB - I have made the crawling in BFO as it was indicated in the Scrapy Doc.

What am I missing?

like image 755
Siddharth Avatar asked Feb 23 '23 05:02

Siddharth


2 Answers

Scrapy will by default filter out all duplicate requests.

You can circumvent this by using (example):

yield Request(url="test.com", callback=self.callback, dont_filter = True)

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

Also see the Request object documentation

like image 194
Sjaak Trekhaak Avatar answered Feb 24 '23 19:02

Sjaak Trekhaak


I had a similar problem today, although I was using a custom spider. It turned out that the website was limiting my crawl because my useragent was scrappy-bot

try changing your user agent and try again. Change it to maybe that of a known browser

Another thing you might want to try is adding a delay. Some websites prevent scraping if the time between request is too small. Try adding a DOWNLOAD_DELAY of 2 and see if that helps

More information about DOWNLOAD_DELAY at http://doc.scrapy.org/en/0.14/topics/settings.html

like image 42
CodeMonkeyB Avatar answered Feb 24 '23 19:02

CodeMonkeyB