Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cant crawl scrapy with depth more than 1

Tags:

scrapy

I couldn't configure scrapy to run with depth > 1, I have tried the 3 following options, noone of them worked and the request_depth_max at summary log is always 1:

1) Adding:

from scrapy.conf import settings
settings.overrides['DEPTH_LIMIT'] = 2

to spider file (the example on site, just with different site)

2) Running the command line with -s option:

/usr/bin/scrapy crawl -s DEPTH_LIMIT=2 mininova.org

3) Adding to settings.py and scrapy.cfg:

DEPTH_LIMIT=2

How should it be configured to more than 1?

like image 836
user555757 Avatar asked Aug 14 '12 19:08

user555757


2 Answers

warwaruk is right, The default value of DEPTH_LIMIT setting is 0 - i.e. "no limit is imposed".

So let's scrape miniova and see what happens. Starting at the today page we see that there are two tor links:

stav@maia:~$ scrapy shell http://www.mininova.org/today
2012-08-15 12:27:57-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'[APSKAFT-018] Apskaft presents: Musique Concrte', fragment='', nofollow=False), Link(url='http://www.mininova.org/tor/13204737', text=u'e4g020-graphite412', fragment='', nofollow=False)]

Let's scrape the first link, where we see there are no new tor links on that page, just the link to iteself, which does not get recrawled by default (scrapy.http.Request(url[, ... dont_filter=False, ...])):

>>> fetch('http://www.mininova.org/tor/13204738')
2012-08-15 12:30:11-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204738> (referer: None)
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'General information', fragment='', nofollow=False)]

No luck there, we are still at depth 1. Let's try the other link:

>>> fetch('http://www.mininova.org/tor/13204737')
2012-08-15 12:31:20-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204737> (referer: None)
[Link(url='http://www.mininova.org/tor/13204737', text=u'General information', fragment='', nofollow=False)]

Nope, this page only contains one link as well, a link to itself, which also gets filtered. So there are actually no links to scrape, so Scrapy closes the spider (at depth==1).

like image 175
Steven Almeroth Avatar answered Nov 09 '22 20:11

Steven Almeroth


I had a similar issue, it helped to set follow=True when defining Rule:

follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it default to False.

like image 4
Jakub M. Avatar answered Nov 09 '22 21:11

Jakub M.