Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Error - HTTP status code is not handled or not allowed

I am trying to run a spider but have this log:

2015-05-15 12:44:43+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: reviews)
2015-05-15 12:44:43+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-05-15 12:44:43+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'reviews.spiders', 'SPIDER_MODULES': ['reviews.spiders'], 'DOWNLOAD_DELAY': 2, 'BOT_NAME': 'reviews'}
2015-05-15 12:44:43+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-15 12:44:43+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-15 12:44:43+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-15 12:44:43+0100 [scrapy] INFO: Enabled item pipelines: 
2015-05-15 12:44:43+0100 [theverge] INFO: Spider opened
2015-05-15 12:44:43+0100 [theverge] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-15 12:44:43+0100 [scrapy] ERROR: Error caught on signal handler: <bound method ?.start_listening of <scrapy.telnet.TelnetConsole instance at 0x105127b48>>
    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 1107, in _inlineCallbacks
        result = g.send(result)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/core/engine.py", line 77, in start
        yield self.signals.send_catch_log_deferred(signal=signals.engine_started)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred
        return signal.send_catch_log_deferred(*a, **kw)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred
        *arguments, **named)
    --- <exception caught here> ---
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 140, in maybeDeferred
        result = f(*args, **kw)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 54, in robustApply
        return receiver(*arguments, **named)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/telnet.py", line 47, in start_listening
        self.port = listen_tcp(self.portrange, self.host, self)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/reactor.py", line 14, in listen_tcp
        return reactor.listenTCP(x, factory, interface=host)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 495, in listenTCP
        p.startListening()
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/tcp.py", line 984, in startListening
        raise CannotListenError(self.interface, self.port, le)
    twisted.internet.error.CannotListenError: Couldn't listen on 127.0.0.1:6073: [Errno 48] Address already in use.

This first error, started to appear in all spiders, but the other spiders work anyway. The: "[Errno 48] Address already in use." Then comes:

2015-05-15 12:44:43+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6198
2015-05-15 12:44:44+0100 [theverge] DEBUG: Crawled (403) <GET http://www.theverge.com/reviews> (referer: None)
2015-05-15 12:44:44+0100 [theverge] DEBUG: Ignoring response <403 http://www.theverge.com/reviews>: HTTP status code is not handled or not allowed
2015-05-15 12:44:44+0100 [theverge] INFO: Closing spider (finished)
2015-05-15 12:44:44+0100 [theverge] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 191,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 265,
     'downloader/response_count': 1,
     'downloader/response_status_count/403': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 5, 15, 11, 44, 44, 136026),
     'log_count/DEBUG': 3,
     'log_count/ERROR': 1,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2015, 5, 15, 11, 44, 43, 829689)}
2015-05-15 12:44:44+0100 [theverge] INFO: Spider closed (finished)
2015-05-15 12:44:44+0100 [scrapy] ERROR: Error caught on signal handler: <bound method ?.stop_listening of <scrapy.telnet.TelnetConsole instance at 0x105127b48>>
    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 1107, in _inlineCallbacks
        result = g.send(result)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/core/engine.py", line 300, in _finish_stopping_engine
        yield self.signals.send_catch_log_deferred(signal=signals.engine_stopped)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred
        return signal.send_catch_log_deferred(*a, **kw)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred
        *arguments, **named)
    --- <exception caught here> ---
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 140, in maybeDeferred
        result = f(*args, **kw)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 54, in robustApply
        return receiver(*arguments, **named)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/telnet.py", line 53, in stop_listening
        self.port.stopListening()
    exceptions.AttributeError: TelnetConsole instance has no attribute 'port'

The error "exceptions.AttributeError: TelnetConsole instance has no attribute 'port'" is new to me... Do not know what is happening since all my other spiders to other websites work well.

Can anyone tell me how to fix?

EDIT:

With a reboot this errors disappeared. But still can not crawl with this spider... Here the logs now:

2015-05-15 15:46:55+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: reviews)piders_toshub/reviews (spiderDev) $ scrapy crawl theverge 
2015-05-15 15:46:55+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-05-15 15:46:55+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'reviews.spiders', 'SPIDER_MODULES': ['reviews.spiders'], 'DOWNLOAD_DELAY': 2, 'BOT_NAME': 'reviews'}
2015-05-15 15:46:55+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-15 15:46:55+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-15 15:46:55+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-15 15:46:55+0100 [scrapy] INFO: Enabled item pipelines: 
2015-05-15 15:46:55+0100 [theverge] INFO: Spider opened
2015-05-15 15:46:55+0100 [theverge] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-15 15:46:55+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-05-15 15:46:55+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-05-15 15:46:56+0100 [theverge] DEBUG: Crawled (403) <GET http://www.theverge.com/reviews> (referer: None)
2015-05-15 15:46:56+0100 [theverge] DEBUG: Ignoring response <403 http://www.theverge.com/reviews>: HTTP status code is not handled or not allowed
2015-05-15 15:46:56+0100 [theverge] INFO: Closing spider (finished)
2015-05-15 15:46:56+0100 [theverge] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 191,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 265,
     'downloader/response_count': 1,
     'downloader/response_status_count/403': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 5, 15, 14, 46, 56, 8769),
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2015, 5, 15, 14, 46, 55, 673723)}
2015-05-15 15:46:56+0100 [theverge] INFO: Spider closed (finished)

this "2015-05-15 15:46:56+0100 [theverge] DEBUG: Ignoring response <403 http://www.theverge.com/reviews>: HTTP status code is not handled or not allowed" is strange since I am using a download_delay = 2 and last week I could crawl that website with no problems... What can be happening?

like image 719
Inês Martins Avatar asked Apr 14 '26 09:04

Inês Martins


1 Answers

Address already in use would suggest something else is listening on that port, most likely you are running another spider in parallel? The second error is just a consequence of the first, because it didn't instantiate the port properly, now it can't find it to close it.

I would suggest reboot to make sure no ports are still used, and run only one spider to see if it's working. If it happens again, you can investigate which application is using that port with netstat or a similar tool.

Update: HTTP error 403 Forbidden most likely means you have been banned by the site for making too many requests. To solve this, use a proxy server. Checkout Scrapy HttpProxyMiddleware.

like image 195
bosnjak Avatar answered Apr 17 '26 00:04

bosnjak



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!