Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy spider: dealing with pages that have incorrectly-defined character encoding

Update: this error can be reproduced simply by running this from the command line:

scrapy shell http://www.indiegogo.com/Straight-Talk-About-Your-Future

I'm using Scrapy to crawl a website. Every page I scrape claims to be encoded UTF-8:

<meta content="text/html; charset=utf-8" http-equiv="Content-Type">

But occasionally, the pages contain bytes that fall outside of UTF-8, and I get Scrapy errors like:

exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 131: invalid continuation byte

I still need to scrape these pages, even though they contain unmappable characters. Is there a way to tell Scrapy to override the page's declared encoding, and use another (say, UTF-16) instead?

Here's where the exception is being caught:

2012-05-30 14:43:20+0200 [igg] ERROR: Spider error processing <GET http://www.site.com/page>
    Traceback (most recent call last):
      File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
      File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 368, in callback
        self._startRunCallbacks(result)
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 464, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/Library/Python/2.7/site-packages/scrapy/core/spidermw.py", line 61, in process_spider_output
        result = method(response=response, result=result, spider=spider)
like image 550
Misener Avatar asked May 24 '12 10:05

Misener


2 Answers

There has been some work on encoding in the latest dev scrapy (0.15). It could be worth trying the latest version.

Scrapy lets you access unicode via response.body_as_unicode. This handles encoding detection in a similar way to browsers and you should nearly always use this instead of the raw body. As of scrapy 0.15, it relies on w3lib.encoding.html_to_unicode, with a little customization.

The decoding happens lazily, when someone requests unicode. You can create a new response, specifying the encoding yourself from the one you receive in the spider, however, this shouldn't be necessary.

It's not clear from the traceback which bit of code is actually causing the error to happen. Was there any more detail? Another possibility could be that the body is getting truncated somehow.

If these pages are handled correctly by a browser and not by scrapy, then it would be appreciated if you could make a simple test case and report a bug.

like image 119
Shane Evans Avatar answered Oct 23 '22 10:10

Shane Evans


As you may get various character encodings on webpages it is generally best to decode all your scraped data into unicode asap, deal with it as unicode in the spider, then encode it to whatever encoding you require at the last minute (before you print it or put into a database etc.) I actually wrote a piece about this (based on my own experience with scrapy) two days ago that may be helpful http://www.harman-clarke.co.uk/answers/python-web-scraping-unicode.php

like image 38
ahc Avatar answered Oct 23 '22 11:10

ahc