I parsed a file and saved its content in a database using Django. The website was 100% in English, so I naively assumed it would be ASCII all along, and saved the text happily as unicode.
You guess the rest of the story :-)
When I print, I get the usual encoding error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 48: ordinal not in range(128)
A quick search tells me that u'\u2019' is the UTF-8 representation of ’
.
repr(string)
displays me this:
"u'his son\\u2019s friend'"
Then of course I tried django.utils.encoding.smart_str
and a more direct approach using string.encode('utf-8'), and I ended up with something printable. Unfortunatly, it prints like this in my (linux UTF-8) terminal:
In [76]: repr(string.encode('utf-8'))
Out[76]: "'his son\\xe2\\x80\\x99s friend '"
In [77]: print string.encode('utf-8')
his son�s friend
Not what I expected. I suspect I double encoded something or missed an important point.
Of course the file original encoding is not pusblished with the file. I guess I could read the HTTP headers or ask the webmaster but since \u2019s looks like UTF-8, I assumed it was utf-8. I can be very wrong, tell me if I am.
Solutions obviously appreciated, but a deep explanation on the cause and what to do to avoid this to happen again would be even more. I often get bitten with encoding, which shows that I still don't master completly the subject.
You are fine. You have the proper data. Yes, the original data is UTF-8 (based on context u2019 makes perfect sense as an apostrophe between "son" and "s"). The weird ?
error character probably just means your terminal configuration's font doesn't have a glyph for this character (fancy apostrophe). No big deal. The data will be correct where it counts. If you are nervous, try some different terminal/OS combinations (I'm on OS X using iTerm). I spent a lot of time explaining to my QA guys that the scary ?
question mark character just means they don't have a Chinese font installed on their windows box (In my case we were testing with Chinese data). Here's some comments
#Create a Python Unicode object
#(abstract code points, independent of any encoding)
#single backslash tells python we want to represent
#a code point by its unicode code point number, typed out with ASCII numbers
>>> s1 = u'his son\u2019s friend'
#If you just type it at the prompt,
#the interpreter does the equivalent of `print repr(s1)`
#and since repr means "show it like a string typed into a python source file",
#you get your ASCII escaped version back
>>> s1
u'his son\u2019s friend'
>>> print repr(s1)
u'his son\u2019s friend'
#This isn't ASCII, so encoding into ASCII generates your original
#error as expected
>>> s1.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character
u'\u2019' in position 7:
ordinal not in range(128)
# Encode in UTF-8 and now we have a string,
# which gets displayed as hex escapes.
#Unicode code point 2019 looks like it gets 3 bytes in UTF-8 (yup, it does)
>>> s1.encode('utf-8')
'his son\xe2\x80\x99s friend'
#My terminal DOES have a different glyph (symbol) to use here,
#so it displays OK for me.
#Note that my terminal has a different glyph for a normal ASCII apostrophe
#(straight vertical)
>>> print s1
his son’s friend
>>> repr(s1)
"u'his son\\u2019s friend'"
>>> str(s1.encode('utf-8'))
'his son\xe2\x80\x99s friend'
See also: http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html
See also for character 2019 (e28099 in hex, search for "2019" on this page): http://www.utf8-chartable.de/unicode-utf8-table.pl?start=8000
See also: http://www.joelonsoftware.com/articles/Unicode.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With