Different unicode handling python2.7.9 vs 2.7.15

Question

I have a large project which runs fine with 2.7.9 on many devices.

But now the devices use python 2.7.15 and in some cases it crashes, when someone uses umlaute/eszett like äöüß.
In that case, a line like this raised an exception

logger.info("device name {}".format(device_name))

I build a minimal test.py to reproduce the problem.

# -*- coding: utf-8 -*-

import locale
import os
import sys

print("#1 sys.stdout.encoding={}".format(sys.stdout.encoding))
print("#2 {}".format(locale.getdefaultlocale()))

u = u'aé ä ö ü ß'
print("#repr: " + repr(u.encode('utf-8')))
print("#3 type(u)={}".format(type(u)))

print(u.encode('utf-8', errors='ignore'))

print("#5 u={}".format(u))

With python 2.7.9 it's fine

#1 sys.stdout.encoding=ANSI_X3.4-1968
#2 (None, None)
#repr: 'a\xc3\xa9 \xc3\xa4 \xc3\xb6 \xc3\xbc \xc3\x9f'
#3 type(u)=<type 'unicode'>
aé ä ö ü ß
#5 u=aé ä ö ü ß

This fails only with 2.7.15, output:

#1 sys.stdout.encoding=ANSI_X3.4-1968
#2 (None, None)
#repr: 'a\xc3\xa9 \xc3\xa4 \xc3\xb6 \xc3\xbc \xc3\x9f'
#3 type(u)=<type 'unicode'>
aé ä ö ü ß
Traceback (most recent call last):
  File "utf8.py", line 16, in <module>
    print("#5 u={}".format(u))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)

Even when I used:

export PYTHONIOENCODING="UTF-8"
export LC_ALL=en_GB.utf8
export LANG=en_GB.utf8

This alters the output, but doesn't help

#1 sys.stdout.encoding=UTF-8
#2 ('en_GB', 'UTF-8')
#repr: 'a\xc3\xa9 \xc3\xa4 \xc3\xb6 \xc3\xbc \xc3\x9f'
#3 type(u)=<type 'unicode'>
aé ä ö ü ß
Traceback (most recent call last):
  File "utf8.py", line 16, in <module>
    print("#5 u={}".format(u))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)

I can fix this error with:

reload(sys)
sys.setdefaultencoding('utf8')

But this solution seems to be very discouraged and I fear the side effects.

But how to fix it in a sane way?
Currently, update to python3 isn't an option.

jsbueno · Accepted Answer

One of the biggest changes in Python3 is the use of unicode strings by default.

If it is in your power to change the files were the problem occurs, you can enhance the text-behavior of your program by backporting the unicode-by-default nature to your Python2 code adding from __future__ import unicode_literals (I'd also suggest switching to the much nicer print as a function with (from __future__ import print_function)

In doing that, you will have to watch the places were your code output text back to the "outside world" - all print, log, database and file write calls: these may require that you send byte-strings. All ou have to do is to place a manual encoding at these points:

   logger.info("device name {}".format(device_name).encode("utf-8")

(the print function, however, can handle unicode-strings and will automatically use the guessed terminal encoding to do its output).

TL;DR: Always have all the text in your program as unicode objects. All of it, even string literals - and just decode from bytes, and encode back to bytes at the interfacing of your system with external components (any I/O).

This may be termed the "unicode sandwich" - and can eliminate 97 in 100 encoding headaches. (you may have to spend some time finding "which" encoding you need - but you will know exactly where to place the decode (bytes-to-text): at any function getting data into your program, and encode (text-to-bytes): any function getting data out of your program)

Different unicode handling python2.7.9 vs 2.7.15

Tags:

encoding

unicode

python-2.7

jeb

1 Answers

jsbueno

Recent Activity

Donate For Us

Different unicode handling python2.7.9 vs 2.7.15

Tags:

encoding

unicode

python-2.7

jeb

1 Answers

jsbueno

Related questions

Recent Activity

Donate For Us