The following code works in Python 3:
people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))
And produces the following output:
Nicholas Gyeney, André
Writers: Nicholas Gyeney, André
In Python 2.7, though, I get the following error:
Traceback (most recent call last):
File "python", line 4, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9'
in position 21: ordinal not in range(128)
I can fix this error by changing ", ".join(people)
to ", ".join(people).encode('utf-8')
, but if I do so, the output in Python 3 changes to:
b'Nicholas Gyeney, Andr\xc3\xa9'
Writers: b'Nicholas Gyeney, Andr\xc3\xa9'
So I tried to use the following code:
if sys.version_info < (3, 0):
reload(sys)
sys.setdefaultencoding('utf-8')
people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))
Which makes my code work in all versions of Python. But I read that using setdefaultencoding
is discouraged.
What's the best approach to deal with this issue?
First we assume that you want to support Python 2.7 and 3.5 versions (2.6 and 3.0 to 3.2 are handled a bit differently).
As you have already read, setdefaultencoding
is discouraged and actually not needed in your case.
To write cross platform code dealing with unicode text, you generally only need to specify string encoding at several places:
# -*- coding: utf-8 -*-
(only if you have string literals with unicode text in your code)Here is how I changed your example by following those rules:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
people = ['Nicholas Gyeney', 'André']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))
print(type(writers))
print(len(writers))
which outputs:
<type 'str'>
23
Here is what changed:
\xe9
with the actual Unicode character (é
)u
prefixesIt works just nicely in Python 2.7.12 and 3.5.2.
But be warned that removing the u
prefixes will make python use regular str
type instead of unicode
(see output of print(type(writers))
). In case of utf-8
it works in most places as if it were a unicode string, but when checking the text length a wrong value will be returned. In this example len
returns 23
, where the actual number of characters is 22
. This is because the underlying type is str
, which counts each byte as a character, but character é
should actually be two bytes.
In other words this works when outputing data fine (as in your example), but not if you want to do string manipulation on the text. In this case, you still need to use the u
prefix or convert the data to unicode type excplicitly, before string manipulation.
So, if it was not for your simple example, it would be better to still use the u
prefix. You need that in two places:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
people = [u'Nicholas Gyeney', u'André']
writers = ", ".join(people)
print(writers)
print(u"Writers: {}".format(writers))
print(type(writers))
print(len(writers))
which outputs:
<type 'unicode'>
22
Note: u
prefix was removed in Python 3.0 and then reintroduced again in Python 3.3 for backward compatibility.
Detailed explanation of all intricacies of working with unicode text in Python 2 is available in official documentation: Python 2 - Unicode HOWTO.
Here is an excerpt for the special comment specifying file encoding:
Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:
#!/usr/bin/env python # -*- coding: latin-1 -*- u = u'abcdé' print ord(u[-1])
The syntax is inspired by Emacs’s notation for specifying variables local to a file. Emacs supports many different variables, but Python only supports
coding
. The-*-
symbols indicate to Emacs that the comment is special; they have no significance to Python but are a convention. Python looks forcoding: name
orcoding=name
in the comment.If you don’t include such a comment, the default encoding used will be ASCII.
If you get get hold of the book "Learning Python, 5th Edition", I encourage you to read Chapter 37 "Unicode and Byte Strings" in Part VIII. Advanced Topics. It contains detailed explanation for working with Unicode text in both generations of Python.
Another detail worth mentioning is that format
always returns an ascii
string if the format string was ascii
, no matter that the arguments were in unicode
.
Contrary to that, old style formatting with %
returns a unicode
string if any of the arguments are unicode
. So instead of writing this
print(u"Writers: {}".format(writers))
you could write this, which is not only shorter and prettier, but works in both Python 2 and 3:
print("Writers: %s" % writers)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With