Scrapy tutorial exceptions

Tags:

scrapy

I'm following the Scrapy tutorial documentation at http://media.readthedocs.org/pdf/scrapy/0.14/scrapy.pdf and I've verified that items.py and dmoz_spider.py are typed (not cut & pasted) correctly.

The first "hmmm..." part for me was this instruction:

This is the code for our first Spider; save it in a file named dmoz_spider.py under the dmoz/spiders directory

I'm using the latest version of Ubuntu and there wasn't a dmoz folder created, so I've put this code into ~/tutorial/tutorial/spiders. (Was this my first error?)

So here's my dmoz_spider.py script:

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["dmoz.org"]
   start_urls = [
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
   ]

def parse(self, response):
   filename = response.url.split("/")[-2]
   open(filename, 'wb').write(response.body)

In my terminal I type

scrapy crawl dmoz

And I get this:

2012-10-08 13:20:22-0700 [scrapy] INFO: Scrapy 0.12.0.2546 started (bot: tutorial)
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled item pipelines: 
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-10-08 13:20:22-0700 [dmoz] INFO: Spider opened
2012-10-08 13:20:22-0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2012-10-08 13:20:22-0700 [dmoz] ERROR: Spider error processing <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in mainLoop
    self.runUntilCurrent()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
    self._startRunCallbacks(result)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/scrapy/spider.py", line 62, in parse
    raise NotImplementedError
exceptions.NotImplementedError: 

2012-10-08 13:20:22-0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2012-10-08 13:20:22-0700 [dmoz] ERROR: Spider error processing <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in mainLoop
    self.runUntilCurrent()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
    self._startRunCallbacks(result)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/scrapy/spider.py", line 62, in parse
    raise NotImplementedError
exceptions.NotImplementedError: 

2012-10-08 13:20:22-0700 [dmoz] INFO: Closing spider (finished)
2012-10-08 13:20:22-0700 [dmoz] INFO: Spider closed (finished)

In my searching, I saw that someone else had said twisted probably wasn't installed... but wouldn't it be installed if I used the Ubuntu package installer for Scrapy?

Thanks in advance!

798

asked Oct 08 '12 20:10

1 Answers

The parse method in BaseSpider is getting called instead of your one because you have not correctly overridden the parse method. Your indentation is wrong, so parse is declared as a function outside of the DmozSpider class. Welcome to python :)

It's nothing to do with twisted, I can see that twisted is in the tracebacks, so it's clearly installed.

121

answered Sep 23 '22 05:09

Shane Evans

Related questions
                            
                                Pass argument to scrapy spider within a python script
                            
                                Scrapy get all children / ignore <br>?
                            
                                How to get scraped items from main script using scrapy?
                            
                                Websocket Server with twisted and Python doing complex jobs in the background
                            
                                Running Multiple spiders in scrapy
                            
                                How restart Scrapy spider
                            
                                Scrapy : Sending information to prior function
                            
                                SQL server, pyodbc and deadlock errors
                            
                                How "download_slot" works within scrapy
                            
                                PYTHON SCRAPY Can't POST information to FORMS,
                            
                                How does Scrapy pause/resume work?
                            
                                How can I start to write Unit test in web Scrapy using python?
                            
                                Scrapy: Pass arguments to cmdline.execute()
                            
                                Difference between LinkExtractor and SgmlLinkExtractor
                            
                                How to upload crawled data from Scrapy to Amazon S3 as csv or json?
                            
                                handle all exception in scrapy with sentry
                            
                                wget with sleep for friendly crawl
                            
                                ImportError : cannot import name '_win32stdio'
                            
                                Sqlalchemy : Dynamically create table from Scrapy item
                            
                                How to scrape data from a website when linked to event clicks?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy tutorial exceptions

Tags:

scrapy

user1729889

People also ask

1 Answers

Shane Evans

Recent Activity

Donate For Us