Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Scrapy tutorial KeyError: 'Spider not found:

Tags:

python

scrapy

I'm trying to write my first scrapy spider, Ive been following the tutorial at http://doc.scrapy.org/en/latest/intro/tutorial.html But I'm getting an error "KeyError: 'Spider not found: "

I think I'm running the command from the correct directory (the one with the scrapy.cfg file)

(proscraper)#( 10/14/14@ 2:06pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
   tree
.
├── scrapy
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── juno_spider.py
└── scrapy.cfg

2 directories, 7 files
(proscraper)#( 10/14/14@ 2:13pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
   ls
scrapy  scrapy.cfg

Here is the error I'm getting

(proscraper)#( 10/14/14@ 2:13pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
   scrapy crawl juno
/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/twisted/internet/_sslverify.py:184: UserWarning: You do not have the service_identity module installed. Please install it from <https://pypi.python.org/pypi/service_identity>. Without the service_identity module and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimentary TLS client hostnameverification.  Many valid certificate/hostname mappings may be rejected.
  verifyHostname, VerificationError = _selectVerifyImplementation()
Traceback (most recent call last):
  File "/home/tim/.virtualenvs/proscraper/bin/scrapy", line 9, in <module>
    load_entry_point('Scrapy==0.24.4', 'console_scripts', 'scrapy')()
  File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 58, in run
    spider = crawler.spiders.create(spname, **opts.spargs)
  File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/spidermanager.py", line 44, in create
    raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: juno'

This is my virtualenv:

(proscraper)#( 10/14/14@ 2:13pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
   pip freeze
Scrapy==0.24.4
Twisted==14.0.2
cffi==0.8.6
cryptography==0.6
cssselect==0.9.1
ipdb==0.8
ipython==2.3.0
lxml==3.4.0
pyOpenSSL==0.14
pycparser==2.10
queuelib==1.2.2
six==1.8.0
w3lib==1.10.0
wsgiref==0.1.2
zope.interface==4.1.1

Here is the code for my spider wth the name attribute filled in:

(proscraper)#( 10/14/14@ 2:14pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
   cat scrapy/spiders/juno_spider.py 
import scrapy

class JunoSpider(scrapy.Spider):
    name = "juno"
    allowed_domains = ["http://www.juno.co.uk/"]
    start_urls = [
        "http://www.juno.co.uk/dj-equipment/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)
like image 805
Tim Avatar asked Oct 14 '14 11:10

Tim


People also ask

What is a spider in scrapy?

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).

Where are spider not found?

There are only a handful of locations on earth where spiders cannot be found. Among these areas are the world's oceans (though some spiders have adapted to life on shorelines and shallow bodies of freshwater), polar regions, like the arctic and Antarctica, and at extreme altitudes of tall mountains.

How do I activate a scrapy project?

Using the scrapy tool You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands: Scrapy X.Y - no active project Usage: scrapy <command> [options] [args] Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader [...]

What is scrapy CFG file?

The scrapy.cfg file is a project root directory, which includes the project name with the project settings. For instance − [settings] default = [name of the project].settings [deploy] #url = http://localhost:6800/ project = [name of the project]


1 Answers

When you start a project with scrapy as the project name it creates the directory structure you printed:

.
├── scrapy
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── juno_spider.py
└── scrapy.cfg

But using scrapy as the project name has a collateral effect. If you open the generated scrapy.cfg you will see that your default settings points to your scrapy.settings module.

[settings]
default = scrapy.settings

When we cat the scrapy.settings file we see:

BOT_NAME = 'scrapy'

SPIDER_MODULES = ['scrapy.spiders']
NEWSPIDER_MODULE = 'scrapy.spiders'

Well, nothing strange here. The bot name, the list of modules where Scrapy will look for spiders, and the module where to create new spiders using the genspider command. So far, so good.

Now let's check the scrapy library. It has been properly installed under your proscraper isolated virtualenv under the /home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy directory. Remember that site-packages is always added to the sys.path, that contains all the paths from where Python is going to search for the modules. So, guess what... the scrapy library also has a settings module /home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/settings that imports /home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/settings/default_settings.py that holds the default values for all the settings. Special attention to the default SPIDER_MODULES entry:

SPIDER_MODULES = []

Maybe you are starting to get what is happening. Choosing scrapy as the project name also generated a scrapy.settings module that clashes with the scrapy library scrapy.settings. And here is where the order in how the corresponding paths were inserted in sys.path will make Python to import one or the other. First to appear wins. In this case the scrapy library settings wins. And hence the KeyError: 'Spider not found: juno'.

To solve this conflict you could rename your project folder to another name, let's say scrap:

.
├── scrap
│   ├── __init__.py

Modify your scrapy.cfg to point to the proper settings module:

[settings]
default = scrap.settings

And update your scrap.settings to point to the proper spiders:

SPIDER_MODULES = ['scrap.spiders']

But as @paultrmbrth suggested I would recreate the project with another name.

like image 103
dreyescat Avatar answered Oct 17 '22 17:10

dreyescat