Python 2.X: Why Can't I Properly Handle Unicode?

Tags:

I have been experimenting for a while with Python 2.X and unicode. But I've reached a point where it doesn't make sense.

First problem:

Some code will clearly explain what I mean. The txt variable is here to simulate the pyqt4 translate function. Which returns a QString.

# -*- coding: utf-8 -*-
from PyQt4 import QtCore
txt = QtCore.QString(u'può essere / sarà / 日本語')
txtUnicode1 = unicode(txt, errors='replace')
txtUnicode2 = unicode(txt)

When print()-ing the two unicode strings, I get:

pu� essere / sar� / ???

può essere / sarà / 日本語

Surely I could get the same thing by using QString.__str__(), but my point is understanding, so for the sake of argument I would like to know:

Why does the errors='replace' replaces all encoded characters when it's only supposed to be doing that in case of errors?
Is there a problem with just using unicode(QString) to make the QString into a unicode string suitable for displaying?

Second problem:

I am trying to understand encoding/decoding.

>>> a = u'può essere / sarà / 日本'
>>> b = a.encode('utf-8')
>>> a
u'pu\xf2 essere / sar\xe0 / \u65e5\u672c'
>>> b
'pu\xc3\xb2 essere / sar\xc3\xa0 / \xe6\x97\xa5\xe6\x9c\xac'
>>> print a
può essere / sarà / 日本
>>> print b
può essere / sarà / 日本

Does print decodes a and b?
Encoded-encoded UTF-8 is supposed to be decoded entirely? Shouldn't I have the encoded string printed?
What is the difference between encoded and decoded unicode string?

708

asked Mar 08 '12 14:03

Aki

1 Answers

Let's fire up the old standby, IDLE, and see if we can replicate what you're seeing.

IDLE 1.1.4      
>>> a = u'può essere / sarà / 日本'

Unsupported characters in input
>>> a = u'pu\xf2 essere / sar\xe0 / \u65e5\u672c'
>>> b = a.encode('utf-8')
>>> a
u'pu\xf2 essere / sar\xe0 / \u65e5\u672c'
>>> b
'pu\xc3\xb2 essere / sar\xc3\xa0 / \xe6\x97\xa5\xe6\x9c\xac'
>>> print a
può essere / sarà / 日本
>>> print b
puÃ² essere / sarÃ  / æ—¥æœ¬

Note that I see something different when I print b. This is because my shell (IDLE) does not interpret a sequence of bytes as UTF-8 text, but rather uses my platform character encoding, cp1252.

Let's just double check this.

>>> import sys
>>> sys.stdout.encoding
'cp1252'

Yup, that's why I get different behavior than you do. Because your sys.stdout.encoding is UTF-8. And that is why, despite a and b being completely different values, they display the same; your terminal interprets bytes as UTF-8.

So you might be wondering if we can convert our sequence of unicode characters a into a sequence of bytes that can be displayed in IDLE

>>> c = a.encode('cp1252') 

Traceback (most recent call last):
  File "<pyshell#19>", line 1, in -toplevel-
    c = a.encode('cp1252') #uses default encoding
  File "C:\Python24\lib\encodings\cp1252.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 20-21: character maps to <undefined>

The answer is no; cp1252 does not support encoding Chinese characters as bytes.

177

answered Sep 19 '22 02:09

ironchefpython

Related questions
                            
                                fast way to read from StringIO until some byte is encountered
                            
                                10*10 fold cross validation in scikit-learn?
                            
                                Python: argparse subcommand subcommand?
                            
                                Twisted MySQL adbapi return dictionary
                            
                                Is there a difference between `%`-format operator and `str.format()` in python regarding unicode and utf-8 encoding?
                            
                                How to properly handle and retain system shutdown (and SIGTERM) in order to finish its job in Python?
                            
                                How to plot with x-axis at the top of the figure?
                            
                                Python catch any exception, and print or log traceback with variable values
                            
                                Python Timeit and “global name ... is not defined”
                            
                                Send a non-ASCII POST request in Python?
                            
                                Global Interpreter Lock and access to data (eg. for NumPy arrays)
                            
                                ring buffer with numpy/ctypes
                            
                                Ordering query result by numeric strings in django (postgres backend)
                            
                                How to get output from gdb.execute in PythonGDB (GDB 7.1)?
                            
                                python and XML: how to place two documents into a single document
                            
                                django csrf_token not printing hidden input field
                            
                                Is there a way to configure a Python logging formatter via config file to log time as Unix timestamp?
                            
                                Matplotlib: Assign Colors to Lines
                            
                                subprocess.Popen and shlex.split formatting in windows and linux
                            
                                Another Simple Random Walk Simulation Using Python(Two-Dimensional)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python 2.X: Why Can't I Properly Handle Unicode?

Tags:

python

unicode

python-2.x

pyqt4

Aki

People also ask

1 Answers

ironchefpython

Recent Activity

Donate For Us