Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I decode this utf-8 string, picked on a random website and saved by the Django ORM, using Python?

I parsed a file and saved its content in a database using Django. The website was 100% in English, so I naively assumed it would be ASCII all along, and saved the text happily as unicode.

You guess the rest of the story :-)

When I print, I get the usual encoding error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 48: ordinal not in range(128)

A quick search tells me that u'\u2019' is the UTF-8 representation of .

repr(string) displays me this:

"u'his son\\u2019s friend'"

Then of course I tried django.utils.encoding.smart_str and a more direct approach using string.encode('utf-8'), and I ended up with something printable. Unfortunatly, it prints like this in my (linux UTF-8) terminal:

In [76]: repr(string.encode('utf-8'))
Out[76]: "'his son\\xe2\\x80\\x99s friend '"

In [77]: print string.encode('utf-8')
his son�s friend

Not what I expected. I suspect I double encoded something or missed an important point.

Of course the file original encoding is not pusblished with the file. I guess I could read the HTTP headers or ask the webmaster but since \u2019s looks like UTF-8, I assumed it was utf-8. I can be very wrong, tell me if I am.

Solutions obviously appreciated, but a deep explanation on the cause and what to do to avoid this to happen again would be even more. I often get bitten with encoding, which shows that I still don't master completly the subject.

like image 478
e-satis Avatar asked Jul 07 '11 05:07

e-satis


1 Answers

You are fine. You have the proper data. Yes, the original data is UTF-8 (based on context u2019 makes perfect sense as an apostrophe between "son" and "s"). The weird ? error character probably just means your terminal configuration's font doesn't have a glyph for this character (fancy apostrophe). No big deal. The data will be correct where it counts. If you are nervous, try some different terminal/OS combinations (I'm on OS X using iTerm). I spent a lot of time explaining to my QA guys that the scary ? question mark character just means they don't have a Chinese font installed on their windows box (In my case we were testing with Chinese data). Here's some comments

#Create a Python Unicode object
#(abstract code points, independent of any encoding)
#single backslash tells python we want to represent
#a code point by its unicode code point number, typed out with ASCII numbers
>>> s1 = u'his son\u2019s friend'

#If you just type it at the prompt,
#the interpreter does the equivalent of `print repr(s1)`
#and since repr means "show it like a string typed into a python source file",
#you get your ASCII escaped version back
>>> s1
u'his son\u2019s friend'
>>> print repr(s1)
u'his son\u2019s friend'

#This isn't ASCII, so encoding into ASCII generates your original
#error as expected
>>> s1.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character
 u'\u2019' in position 7: 
ordinal not in range(128)

# Encode in UTF-8 and now we have a string,
# which gets displayed as hex escapes.     
#Unicode code point 2019 looks like it gets 3 bytes in UTF-8 (yup, it does)
>>> s1.encode('utf-8')
'his son\xe2\x80\x99s friend'

#My terminal DOES have a different glyph (symbol) to use here,
#so it displays OK for me.
#Note that my terminal has a different glyph for a normal ASCII apostrophe
#(straight vertical)
>>> print s1
his son’s friend
>>> repr(s1)
"u'his son\\u2019s friend'"
>>> str(s1.encode('utf-8'))
'his son\xe2\x80\x99s friend'

See also: http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html

See also for character 2019 (e28099 in hex, search for "2019" on this page): http://www.utf8-chartable.de/unicode-utf8-table.pl?start=8000

See also: http://www.joelonsoftware.com/articles/Unicode.html

like image 55
Peter Lyons Avatar answered Sep 23 '22 22:09

Peter Lyons