I'm having a problem with Python's string.format()
and passing Unicode strings to it. This is similar to this older question, except that in my case the test code explodes on the print, not on the logging.info()
call. Passing the same Unicode string object to a logging handler works fine.
This fails equally well with the older %
formatting as well as string.format()
. Just to make sure it was the string object that is the problem, and not print interacting badly with my terminal, I tried assigning the formatted string to a variable before printing.
def unicode_test():
byte_string = '\xc3\xb4'
unicode_string = unicode(byte_string, "utf-8")
print "unicode object type: {}".format(type(unicode_string))
output_string = "printed unicode object: {}".format(unicode_string)
print output_string
if __name__ == '__main__':
unicode_test()
The string object seems to assume it's getting ASCII.
% python -V
Python 2.7.2
% python ./unicodetest.py
unicode object type: <type 'unicode'>
Traceback (most recent call last):
File "./unicodetest.py", line 10, in <module>
unicode_test()
File "./unicodetest.py", line 6, in unicode_test
output_string = "printed unicode object: {}".format(unicode_string)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf4' in position 0: ordinal not in range(128)
Trying to cast output_string
as Unicode doesn't make any difference.
output_string = u"printed unicode object: {}".format(unicode_string)
Am I missing something here? The documentation for the string object seems pretty clear that this should work as I'm attempting to use it.
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.
Normal strings in Python are stored internally as 8-bit ASCII, while Unicode strings are stored as 16-bit Unicode. This allows for a more varied set of characters, including special characters from most languages in the world.
To create a str in Python 2, you can use the str() built-in, or string-literal syntax, like so: my_string = 'This is my string. ' . To create an instance of unicode , you can use the unicode() built-in, or prefix a string literal with a u , like so: my_unicode = u'This is my Unicode string. ' .
Since Python 3.0, all strings are stored as Unicode in an instance of the str type. Encoded strings on the other hand are represented as binary data in the form of instances of the bytes type. Conceptually, str refers to text, whereas bytes refers to data.
No this should not work (can you cite the part of the documentation that says so ?), but it should work if the formatting pattern is unicode (or with the old formatting which 'promotes' the pattern to unicode instead of trying to 'demote' arguments).
>>> x = "\xc3\xb4".decode('utf-8')
>>> x
u'\xf4'
>>> x + 'a'
u'\xf4a'
>>> 'a' + x
u'a\xf4'
>>> 'a %s' % x
u'a \xf4'
>>> 'a {}'.format(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec
can't encode character u'\xf4' in position 0: ordinal not in range(128)
>>> u'a {}'.format(x)
u'a \xf4'
>>> print u"Foo bar {}".format(x)
Foo bar ô
Edit: The print
line may not work for you if the unicode string can't be encoded using your console's encoding. For example, on my Windows console:
>>> import sys
>>> sys.stdout.encoding
'cp852'
>>> u'\xf4'.encode('cp852')
'\x93'
On a UNIX console this may related to your locale settings. It will also fail if you redirect output (like when using |
in shell). Most of this issues have been fixed in Python 3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With