Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 string in python 2 and 3

The following code works in Python 3:

people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))

And produces the following output:

Nicholas Gyeney, André  
Writers: Nicholas Gyeney, André

In Python 2.7, though, I get the following error:

Traceback (most recent call last):
  File "python", line 4, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
in position 21: ordinal not in range(128)

I can fix this error by changing ", ".join(people) to ", ".join(people).encode('utf-8'), but if I do so, the output in Python 3 changes to:

b'Nicholas Gyeney, Andr\xc3\xa9'  
Writers: b'Nicholas Gyeney, Andr\xc3\xa9'

So I tried to use the following code:

if sys.version_info < (3, 0):
    reload(sys)
    sys.setdefaultencoding('utf-8')

people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))

Which makes my code work in all versions of Python. But I read that using setdefaultencoding is discouraged.

What's the best approach to deal with this issue?

like image 391
B Faley Avatar asked Jan 09 '17 07:01

B Faley


1 Answers

First we assume that you want to support Python 2.7 and 3.5 versions (2.6 and 3.0 to 3.2 are handled a bit differently).

As you have already read, setdefaultencoding is discouraged and actually not needed in your case.

To write cross platform code dealing with unicode text, you generally only need to specify string encoding at several places:

  1. At top of your script, below the shebang with # -*- coding: utf-8 -*- (only if you have string literals with unicode text in your code)
  2. When you read input data (eg. from text file or database)
  3. When you output data (again from text file or database)
  4. When you define a string literal in code

Here is how I changed your example by following those rules:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

people = ['Nicholas Gyeney', 'André']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))

print(type(writers))
print(len(writers))

which outputs:

<type 'str'>
23

Here is what changed:

  • Specified file encoding at top of file
  • Replaced \xe9 with the actual Unicode character (é)
  • Removed u prefixes

It works just nicely in Python 2.7.12 and 3.5.2.

But be warned that removing the u prefixes will make python use regular str type instead of unicode (see output of print(type(writers))). In case of utf-8 it works in most places as if it were a unicode string, but when checking the text length a wrong value will be returned. In this example len returns 23, where the actual number of characters is 22. This is because the underlying type is str, which counts each byte as a character, but character é should actually be two bytes.

In other words this works when outputing data fine (as in your example), but not if you want to do string manipulation on the text. In this case, you still need to use the u prefix or convert the data to unicode type excplicitly, before string manipulation.

So, if it was not for your simple example, it would be better to still use the u prefix. You need that in two places:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

people = [u'Nicholas Gyeney', u'André']
writers = ", ".join(people)
print(writers)
print(u"Writers: {}".format(writers))

print(type(writers))
print(len(writers))

which outputs:

<type 'unicode'>
22

Note: u prefix was removed in Python 3.0 and then reintroduced again in Python 3.3 for backward compatibility.

Detailed explanation of all intricacies of working with unicode text in Python 2 is available in official documentation: Python 2 - Unicode HOWTO.

Here is an excerpt for the special comment specifying file encoding:

Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

u = u'abcdé' print ord(u[-1])

The syntax is inspired by Emacs’s notation for specifying variables local to a file. Emacs supports many different variables, but Python only supports coding. The -*- symbols indicate to Emacs that the comment is special; they have no significance to Python but are a convention. Python looks for coding: name or coding=name in the comment.

If you don’t include such a comment, the default encoding used will be ASCII.

If you get get hold of the book "Learning Python, 5th Edition", I encourage you to read Chapter 37 "Unicode and Byte Strings" in Part VIII. Advanced Topics. It contains detailed explanation for working with Unicode text in both generations of Python.

Another detail worth mentioning is that format always returns an ascii string if the format string was ascii, no matter that the arguments were in unicode.

Contrary to that, old style formatting with % returns a unicode string if any of the arguments are unicode. So instead of writing this

print(u"Writers: {}".format(writers))

you could write this, which is not only shorter and prettier, but works in both Python 2 and 3:

print("Writers: %s" % writers)
like image 131
quasoft Avatar answered Oct 01 '22 18:10

quasoft