Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 2.X: Why Can't I Properly Handle Unicode?

I have been experimenting for a while with Python 2.X and unicode. But I've reached a point where it doesn't make sense.

First problem:

Some code will clearly explain what I mean. The txt variable is here to simulate the pyqt4 translate function. Which returns a QString.

# -*- coding: utf-8 -*-
from PyQt4 import QtCore
txt = QtCore.QString(u'può essere / sarà / 日本語')
txtUnicode1 = unicode(txt, errors='replace')
txtUnicode2 = unicode(txt)

When print()-ing the two unicode strings, I get:

pu� essere / sar� / ???

può essere / sarà / 日本語

Surely I could get the same thing by using QString.__str__(), but my point is understanding, so for the sake of argument I would like to know:

  • Why does the errors='replace' replaces all encoded characters when it's only supposed to be doing that in case of errors?
  • Is there a problem with just using unicode(QString) to make the QString into a unicode string suitable for displaying?

Second problem:

I am trying to understand encoding/decoding.

>>> a = u'può essere / sarà / 日本'
>>> b = a.encode('utf-8')
>>> a
u'pu\xf2 essere / sar\xe0 / \u65e5\u672c'
>>> b
'pu\xc3\xb2 essere / sar\xc3\xa0 / \xe6\x97\xa5\xe6\x9c\xac'
>>> print a
può essere / sarà / 日本
>>> print b
può essere / sarà / 日本
  • Does print decodes a and b?
  • Encoded-encoded UTF-8 is supposed to be decoded entirely? Shouldn't I have the encoded string printed?
  • What is the difference between encoded and decoded unicode string?
like image 708
Aki Avatar asked Mar 08 '12 14:03

Aki


People also ask

Does Python 2 support Unicode?

Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters.

Can Python handle Unicode?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.

What causes Unicode error in Python?

The UnicodeEncodeError normally happens when encoding a unicode string into a certain coding. Since codings map only a limited number of unicode characters to str strings, a non-presented character will cause the coding-specific encode() to fail.


1 Answers

Let's fire up the old standby, IDLE, and see if we can replicate what you're seeing.

IDLE 1.1.4      
>>> a = u'può essere / sarà / 日本'

Unsupported characters in input
>>> a = u'pu\xf2 essere / sar\xe0 / \u65e5\u672c'
>>> b = a.encode('utf-8')
>>> a
u'pu\xf2 essere / sar\xe0 / \u65e5\u672c'
>>> b
'pu\xc3\xb2 essere / sar\xc3\xa0 / \xe6\x97\xa5\xe6\x9c\xac'
>>> print a
può essere / sarà / 日本
>>> print b
può essere / sarà / 日本

Note that I see something different when I print b. This is because my shell (IDLE) does not interpret a sequence of bytes as UTF-8 text, but rather uses my platform character encoding, cp1252.

Let's just double check this.

>>> import sys
>>> sys.stdout.encoding
'cp1252'

Yup, that's why I get different behavior than you do. Because your sys.stdout.encoding is UTF-8. And that is why, despite a and b being completely different values, they display the same; your terminal interprets bytes as UTF-8.

So you might be wondering if we can convert our sequence of unicode characters a into a sequence of bytes that can be displayed in IDLE

>>> c = a.encode('cp1252') 

Traceback (most recent call last):
  File "<pyshell#19>", line 1, in -toplevel-
    c = a.encode('cp1252') #uses default encoding
  File "C:\Python24\lib\encodings\cp1252.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 20-21: character maps to <undefined>

The answer is no; cp1252 does not support encoding Chinese characters as bytes.

like image 177
ironchefpython Avatar answered Sep 19 '22 02:09

ironchefpython