Cant crawl scrapy with depth more than 1

Question

I couldn't configure scrapy to run with depth > 1, I have tried the 3 following options, noone of them worked and the request_depth_max at summary log is always 1:

1) Adding:

from scrapy.conf import settings
settings.overrides['DEPTH_LIMIT'] = 2

to spider file (the example on site, just with different site)

2) Running the command line with -s option:

/usr/bin/scrapy crawl -s DEPTH_LIMIT=2 mininova.org

3) Adding to settings.py and scrapy.cfg:

DEPTH_LIMIT=2

How should it be configured to more than 1?

Steven Almeroth · Accepted Answer

warwaruk is right, The default value of DEPTH_LIMIT setting is 0 - i.e. "no limit is imposed".

So let's scrape miniova and see what happens. Starting at the today page we see that there are two tor links:

stav@maia:~$ scrapy shell http://www.mininova.org/today
2012-08-15 12:27:57-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'[APSKAFT-018] Apskaft presents: Musique Concrte', fragment='', nofollow=False), Link(url='http://www.mininova.org/tor/13204737', text=u'e4g020-graphite412', fragment='', nofollow=False)]

Let's scrape the first link, where we see there are no new tor links on that page, just the link to iteself, which does not get recrawled by default (scrapy.http.Request(url[, ... dont_filter=False, ...])):

>>> fetch('http://www.mininova.org/tor/13204738')
2012-08-15 12:30:11-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204738> (referer: None)
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'General information', fragment='', nofollow=False)]

No luck there, we are still at depth 1. Let's try the other link:

>>> fetch('http://www.mininova.org/tor/13204737')
2012-08-15 12:31:20-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204737> (referer: None)
[Link(url='http://www.mininova.org/tor/13204737', text=u'General information', fragment='', nofollow=False)]

Nope, this page only contains one link as well, a link to itself, which also gets filtered. So there are actually no links to scrape, so Scrapy closes the spider (at depth==1).

Jakub M. · Answer

I had a similar issue, it helped to set follow=True when defining Rule:

follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it default to False.

Cant crawl scrapy with depth more than 1

Tags:

scrapy

user555757

2 Answers

Steven Almeroth

Jakub M.

Recent Activity

Donate For Us

Cant crawl scrapy with depth more than 1

Tags:

scrapy

user555757

2 Answers

Steven Almeroth

Jakub M.

Related questions

Recent Activity

Donate For Us