Update: this error can be reproduced simply by running this from the command line:
scrapy shell http://www.indiegogo.com/Straight-Talk-About-Your-Future
I'm using Scrapy to crawl a website. Every page I scrape claims to be encoded UTF-8:
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
But occasionally, the pages contain bytes that fall outside of UTF-8, and I get Scrapy errors like:
exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 131: invalid continuation byte
I still need to scrape these pages, even though they contain unmappable characters. Is there a way to tell Scrapy to override the page's declared encoding, and use another (say, UTF-16) instead?
Here's where the exception is being caught:
2012-05-30 14:43:20+0200 [igg] ERROR: Spider error processing <GET http://www.site.com/page>
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 368, in callback
self._startRunCallbacks(result)
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 464, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Python/2.7/site-packages/scrapy/core/spidermw.py", line 61, in process_spider_output
result = method(response=response, result=result, spider=spider)
There has been some work on encoding in the latest dev scrapy (0.15). It could be worth trying the latest version.
Scrapy lets you access unicode via response.body_as_unicode. This handles encoding detection in a similar way to browsers and you should nearly always use this instead of the raw body. As of scrapy 0.15, it relies on w3lib.encoding.html_to_unicode, with a little customization.
The decoding happens lazily, when someone requests unicode. You can create a new response, specifying the encoding yourself from the one you receive in the spider, however, this shouldn't be necessary.
It's not clear from the traceback which bit of code is actually causing the error to happen. Was there any more detail? Another possibility could be that the body is getting truncated somehow.
If these pages are handled correctly by a browser and not by scrapy, then it would be appreciated if you could make a simple test case and report a bug.
As you may get various character encodings on webpages it is generally best to decode all your scraped data into unicode asap, deal with it as unicode in the spider, then encode it to whatever encoding you require at the last minute (before you print it or put into a database etc.) I actually wrote a piece about this (based on my own experience with scrapy) two days ago that may be helpful http://www.harman-clarke.co.uk/answers/python-web-scraping-unicode.php
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With