Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do define which spider the scrapy shell uses?

I'm trying to test out some XPaths using the Scrapy shell, but it seems to be calling on my incomplete spider module to do the scraping, which is not what I want. Is there a way to define which spider scrapy uses with its shell? Even more, why is Scrapy doing this; shouldn't it know the spider is not ready for use? That's why I'm using the shell right? Otherwise I'd be using

scrapy crawl spider_name

if I wanted to use a specific spider.

Edit: After looking at the Spider docs forever, I found the following description for the spider instance used in the shell.

spider - the Spider which is known to handle the URL, or a BaseSpider object if there is no spider found for the current URL

This means, scrapy has correlated the URL with my spider, and is using it instead of a BaseSpider. Unfortunately, my spider is not ready for testing, so is there a way to force it to use a BaseSpider for the shell instead?

like image 954
emish Avatar asked Jul 02 '11 21:07

emish


1 Answers

Scrapy automatically selects the spider based on the allowed_domains attribute. If there are more than one spider for given domain Scrapy will use BaseSpider.

But, it's just a python shell, you can instantiate any spider you want.

>>> from myproject.spiders.myspider import MySpider
>>> spider = MySpider()
>>> spider.parse_item(response)

Edit: as workaround to not use your spider you can set allowed_domains = []

like image 61
R. Max Avatar answered Oct 06 '22 00:10

R. Max