Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python's string.format() and Unicode

Tags:

python

unicode

I'm having a problem with Python's string.format() and passing Unicode strings to it. This is similar to this older question, except that in my case the test code explodes on the print, not on the logging.info() call. Passing the same Unicode string object to a logging handler works fine.

This fails equally well with the older % formatting as well as string.format(). Just to make sure it was the string object that is the problem, and not print interacting badly with my terminal, I tried assigning the formatted string to a variable before printing.

def unicode_test():
    byte_string = '\xc3\xb4'
    unicode_string = unicode(byte_string, "utf-8")
    print "unicode object type: {}".format(type(unicode_string))
    output_string = "printed unicode object: {}".format(unicode_string)
    print output_string

if __name__ == '__main__':
    unicode_test()

The string object seems to assume it's getting ASCII.

% python -V
Python 2.7.2

% python ./unicodetest.py
unicode object type: <type 'unicode'>
Traceback (most recent call last):
  File "./unicodetest.py", line 10, in <module>
    unicode_test()
  File "./unicodetest.py", line 6, in unicode_test
    output_string = "printed unicode object: {}".format(unicode_string)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf4' in position 0: ordinal not in range(128)

Trying to cast output_string as Unicode doesn't make any difference.

output_string = u"printed unicode object: {}".format(unicode_string)

Am I missing something here? The documentation for the string object seems pretty clear that this should work as I'm attempting to use it.

like image 484
mpounsett Avatar asked Dec 02 '12 22:12

mpounsett


People also ask

What is the difference between Unicode and string in Python?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.

What is a Unicode string in Python?

Normal strings in Python are stored internally as 8-bit ASCII, while Unicode strings are stored as 16-bit Unicode. This allows for a more varied set of characters, including special characters from most languages in the world.

How do you create a Unicode string in Python?

To create a str in Python 2, you can use the str() built-in, or string-literal syntax, like so: my_string = 'This is my string. ' . To create an instance of unicode , you can use the unicode() built-in, or prefix a string literal with a u , like so: my_unicode = u'This is my Unicode string. ' .

Are strings Unicode in Python 3?

Since Python 3.0, all strings are stored as Unicode in an instance of the str type. Encoded strings on the other hand are represented as binary data in the form of instances of the bytes type. Conceptually, str refers to text, whereas bytes refers to data.


1 Answers

No this should not work (can you cite the part of the documentation that says so ?), but it should work if the formatting pattern is unicode (or with the old formatting which 'promotes' the pattern to unicode instead of trying to 'demote' arguments).

>>> x = "\xc3\xb4".decode('utf-8')
>>> x
u'\xf4'
>>> x + 'a'
u'\xf4a'
>>> 'a' + x
u'a\xf4'
>>> 'a %s' % x
u'a \xf4'
>>> 'a {}'.format(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec 
  can't encode character u'\xf4' in position 0: ordinal not in range(128)
>>> u'a {}'.format(x)
u'a \xf4'
>>> print u"Foo bar {}".format(x)
Foo bar ô

Edit: The print line may not work for you if the unicode string can't be encoded using your console's encoding. For example, on my Windows console:

>>> import sys
>>> sys.stdout.encoding
'cp852'
>>> u'\xf4'.encode('cp852')
'\x93'

On a UNIX console this may related to your locale settings. It will also fail if you redirect output (like when using | in shell). Most of this issues have been fixed in Python 3.

like image 69
lqc Avatar answered Oct 04 '22 22:10

lqc