Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fix an encoding migrating Python subprocess to unicode_literals?

We're preparing to move to Python 3.4 and added unicode_literals. Our code relies extensively on piping to/from external utilities using subprocess module. The following code snippet works fine on Python 2.7 to pipe UTF-8 strings to a sub-process:

kw = {}
kw[u'stdin'] = subprocess.PIPE
kw[u'stdout'] = subprocess.PIPE
kw[u'stderr'] = subprocess.PIPE
kw[u'executable'] = u'/path/to/binary/utility'
args = [u'', u'-l', u'nl']

line = u'¡Basta Ya!'

popen = subprocess.Popen(args,**kw)
popen.stdin.write('%s\n' % line.encode(u'utf-8'))
...blah blah...

The following changes throw this error:

from __future__ import unicode_literals

kw = {}
kw[u'stdin'] = subprocess.PIPE
kw[u'stdout'] = subprocess.PIPE
kw[u'stderr'] = subprocess.PIPE
kw[u'executable'] = u'/path/to/binary/utility'
args = [u'', u'-l', u'nl']

line = u'¡Basta Ya!'

popen = subprocess.Popen(args,**kw)
popen.stdin.write('%s\n' % line.encode(u'utf-8'))
Traceback (most recent call last):
  File "test.py", line 138, in <module>
    exitcode = main()
  File "test.py", line 57, in main
    popen.stdin.write('%s\n' % line.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

Any suggestions to pass UTF-8 through the pipe?

like image 380
tahoar Avatar asked Dec 31 '14 14:12

tahoar


1 Answers

'%s\n' is a unicode string when you use unicode_literals:

>>> line = u'¡Basta Ya!'
>>> '%s\n' % line.encode(u'utf-8')
'\xc2\xa1Basta Ya!\n'
>>> u'%s\n' % line.encode(u'utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

What happens is that your encoded line value is being decoded to interpolate into the unicode '%s\n' string.

You'll have to use a byte string instead; prefix the string with b:

>>> from __future__ import unicode_literals
>>> line = u'¡Basta Ya!'
>>> b'%s\n' % line.encode(u'utf-8')
'\xc2\xa1Basta Ya!\n'

or encode after interpolation:

>>> line = u'¡Basta Ya!'
>>> ('%s\n' % line).encode(u'utf-8')
'\xc2\xa1Basta Ya!\n'

In Python 3, you'll have to write bytestrings to pipes anyway.

like image 119
Martijn Pieters Avatar answered Nov 01 '22 21:11

Martijn Pieters