I'm trying to write my first scrapy spider, Ive been following the tutorial at http://doc.scrapy.org/en/latest/intro/tutorial.html But I'm getting an error "KeyError: 'Spider not found: "
I think I'm running the command from the correct directory (the one with the scrapy.cfg file)
(proscraper)#( 10/14/14@ 2:06pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
tree
.
├── scrapy
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── juno_spider.py
└── scrapy.cfg
2 directories, 7 files
(proscraper)#( 10/14/14@ 2:13pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
ls
scrapy scrapy.cfg
Here is the error I'm getting
(proscraper)#( 10/14/14@ 2:13pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
scrapy crawl juno
/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/twisted/internet/_sslverify.py:184: UserWarning: You do not have the service_identity module installed. Please install it from <https://pypi.python.org/pypi/service_identity>. Without the service_identity module and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimentary TLS client hostnameverification. Many valid certificate/hostname mappings may be rejected.
verifyHostname, VerificationError = _selectVerifyImplementation()
Traceback (most recent call last):
File "/home/tim/.virtualenvs/proscraper/bin/scrapy", line 9, in <module>
load_entry_point('Scrapy==0.24.4', 'console_scripts', 'scrapy')()
File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/cmdline.py", line 143, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
func(*a, **kw)
File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/cmdline.py", line 150, in _run_command
cmd.run(args, opts)
File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 58, in run
spider = crawler.spiders.create(spname, **opts.spargs)
File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/spidermanager.py", line 44, in create
raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: juno'
This is my virtualenv:
(proscraper)#( 10/14/14@ 2:13pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
pip freeze
Scrapy==0.24.4
Twisted==14.0.2
cffi==0.8.6
cryptography==0.6
cssselect==0.9.1
ipdb==0.8
ipython==2.3.0
lxml==3.4.0
pyOpenSSL==0.14
pycparser==2.10
queuelib==1.2.2
six==1.8.0
w3lib==1.10.0
wsgiref==0.1.2
zope.interface==4.1.1
Here is the code for my spider wth the name attribute filled in:
(proscraper)#( 10/14/14@ 2:14pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
cat scrapy/spiders/juno_spider.py
import scrapy
class JunoSpider(scrapy.Spider):
name = "juno"
allowed_domains = ["http://www.juno.co.uk/"]
start_urls = [
"http://www.juno.co.uk/dj-equipment/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).
There are only a handful of locations on earth where spiders cannot be found. Among these areas are the world's oceans (though some spiders have adapted to life on shorelines and shallow bodies of freshwater), polar regions, like the arctic and Antarctica, and at extreme altitudes of tall mountains.
Using the scrapy tool You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands: Scrapy X.Y - no active project Usage: scrapy <command> [options] [args] Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader [...]
The scrapy.cfg file is a project root directory, which includes the project name with the project settings. For instance − [settings] default = [name of the project].settings [deploy] #url = http://localhost:6800/ project = [name of the project]
When you start a project with scrapy as the project name it creates the directory structure you printed:
.
├── scrapy
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── juno_spider.py
└── scrapy.cfg
But using scrapy as the project name has a collateral effect. If you open the generated scrapy.cfg
you will see that your default settings points to your scrapy.settings
module.
[settings]
default = scrapy.settings
When we cat the scrapy.settings
file we see:
BOT_NAME = 'scrapy'
SPIDER_MODULES = ['scrapy.spiders']
NEWSPIDER_MODULE = 'scrapy.spiders'
Well, nothing strange here. The bot name, the list of modules where Scrapy will look for spiders, and the module where to create new spiders using the genspider command. So far, so good.
Now let's check the scrapy library. It has been properly installed under your proscraper isolated virtualenv under the /home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy
directory. Remember that site-packages
is always added to the sys.path
, that contains all the paths from where Python is going to search for the modules. So, guess what... the scrapy library also has a settings
module /home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/settings
that imports /home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/settings/default_settings.py
that holds the default values for all the settings. Special attention to the default SPIDER_MODULES
entry:
SPIDER_MODULES = []
Maybe you are starting to get what is happening. Choosing scrapy as the project name also generated a scrapy.settings
module that clashes with the scrapy library scrapy.settings
. And here is where the order in how the corresponding paths were inserted in sys.path
will make Python to import one or the other. First to appear wins. In this case the scrapy library settings wins. And hence the KeyError: 'Spider not found: juno'
.
To solve this conflict you could rename your project folder to another name, let's say scrap
:
.
├── scrap
│ ├── __init__.py
Modify your scrapy.cfg
to point to the proper settings
module:
[settings]
default = scrap.settings
And update your scrap.settings
to point to the proper spiders:
SPIDER_MODULES = ['scrap.spiders']
But as @paultrmbrth suggested I would recreate the project with another name.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With