I simple can't figure out! :( I am scrapping data from an utf-8 encoded site, well that is at least what it says:
Content-Type: text/html;charset=utf-8
I am getting a list of regular unicode strings with XPath selector extract() call:
item['city']= element.select('//div[@id="bubble_2"]/div/text()').extract()
This is the list:
[u'Westbahnhofstr.\xa010', u'72070\xa0T\xfcbingen']
Now I join the list into one unicode string:
item['city']= "".join(element.select('//div[@id="bubble_2"]/div/text()').extract())
So far so good:
u'Beim Nonnenhaus\xa0672070\xa0T\xfcbingen'
The problem appears while I try to output this unicode string either to screen (print) or to a file (write). whatever I try it returns an error (http://pastebin.com/51DkX2R2):
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 11: ordinal not in range(128)
I have encoded unicode to byte string before output of course:
item['city'].encode('utf-8')
This is my pipeline.py and how I use to open and write to my cvs:
import csv
import items
import urlparse
import codecs
class DepostPipeline(object):
def __init__(self):
self.modelsCsv = csv.writer(codecs.open('Dees.csv', mode='w',encoding='utf-8'))
self.modelsCsv.writerow(['city'])
def process_item(self, item, spider):
if isinstance(item, items.DetailsItem):
item['city'] = item['city'].encode('utf-8')
self.modelsCsv.writerow([item['city']])
return item
The most weird thing is that my system (python on windows) handles unicode strings perfectly:
C:\Console2>python
Python 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s=u'Beim Nonnenhaus\xa0672070\xa0T\xfcbingen'
>>> print s
Beim Nonnenhaus 672070 Tübingen
I have been reading about utf-8, unicode, encoding and decoding a lot over the last 10 days but it seems that I still miss something here?! I appreciate any help or advice.
You are ignoring the result of the .encode()
call:
item['city'].encode('utf-8')
Strings are immutable, and are not encoded in-place. Even better, the type of the returned object is different. You'll need to assign the return value back:
item['city'] = item['city'].encode('utf-8')
However, you should not use codecs.open()
for the CSV file. The csv
module will always write bytestrings, not Unicode.
By using a codecs.open()
file object, an implicit decode takes place to get back to Unicode, and it is that that fails for you; it is why you get a UnicodeDecodeError
exception, not an encode error:
File "C:\Python27\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 11: ordinal not in range(128)
Use a regular open()
call instead:
self.modelsCsv = csv.writer(open('Dees.csv', mode='wb'))
Note the 'wb'
; the csv
module handles line endings itself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With