Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

'ascii' codec can't encode character at position * ord not in range(128)

There are a few threads on stackoverflow, but i couldn't find a valid solution to the problem as a whole.

I have collected huge sums of textual data from the urllib read function and stored the same in pickle files.

Now I want to write this data to a file. While writing i'm getting errors similar to -

'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128)

and a lot of data is being lost.

I suppose the data off the urllib read is byte data

I've tried

   1. text=text.decode('ascii','ignore')
   2. s=filter(lambda x: x in string.printable, s)
   3. text=u''+text
      text=text.decode().encode('utf-8')

but still im ending up with similar errors. Can somebody point out a proper solution. And also would codecs strip work. I have no issues if the conflict bytes are not written to the file as a string hence the loss is accepted.

like image 456
minocha Avatar asked Mar 12 '13 14:03

minocha


2 Answers

You can do it through smart_str of Django module. Just try this:

from django.utils.encoding import smart_str, smart_unicode

text = u'\u2019'
print smart_str(text)

You can install Django by starting a command shell with administrator privileges and run this command:

pip install Django
like image 113
Thanasis Petsas Avatar answered Nov 15 '22 15:11

Thanasis Petsas


Your data is unicode data. To write that to a file, use .encode():

text = text.encode('ascii', 'ignore')

but that would remove anything that isn't ASCII. Perhaps you wanted to encode to a more suitable encoding, like UTF-8, instead?

You may want to read up on Python and Unicode:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • The Python Unicode HOWTO

  • Pragmatic Unicode by Ned Batchelder

like image 27
Martijn Pieters Avatar answered Nov 15 '22 15:11

Martijn Pieters