Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concatenating Unicode with string: print '£' + '1' works, but print '£' + u'1' throws UnicodeDecodeError

I've observed the following:

>>> print '£' + '1'
£1
>>> print '£' + u'1'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
>>> print u'£' + u'1'
£1
>>> print u'£' + '1'
£1

Why does '£' + '1' work but '£' + u'1' doesn't work?

I looked at the types:

>>> type('£' + '1')
<type 'str'>
>>> type('£' + u'1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
>>> type(u'£' + u'1')
<type 'unicode'>

This also confuses me. If '£' + '1' is a str and not a unicode, why does it print properly on my terminal? Shouldn't it print something like '\xc2\xa31'?

To add to the mix, I've also observed the following:

>>> u'£' + '1'
u'\xa31'
>>> type('1')
<type 'str'>
>>> type(u'£')
<type 'unicode'>
>>> print u'£' + '1'
£1

Why does u'£' + '1' not print out the £ symbol properly, whereas print u'£' + '1' does? Is it because repr is used in the former, whereas str is used in the latter?

Also, how come concatenation of a unicode and a str work in this case, but not in the '£' + u'1' case?

like image 993
texasflood Avatar asked Aug 02 '15 12:08

texasflood


1 Answers

You are mixing object types.

'£' is a bytestring, containing encoded data. That those bytes happen to represent a pound sign in your terminal or console is neither here nor there, it could just as much have been a pixel in an image. You terminal or console is configured to produce and accept UTF-8 data instead, so the actual content of that bytestring is the two bytes C2 and A3, when expresed in hexadecimal.

u'1' on the other hand is a Unicode string. It is unambiguously text data. If you want to concatenate other data to it, it too should be Unicode. Python 2 then will automatically decode str bytes to Unicode using the default ASCII codec if you try to do this.

However, the '£' bytestring is not decodable as ASCII. It can be decoded as UTF-8; decode the bytes explicitly, since we know the correct codec here:

print '£'.decode('utf8') + u'1'

When writing bytes to the terminal or console, it is your terminal or console that interprets the bytes and makes sense of them. If you write a unicode object to the terminal, the sys.stdout object takes care of encoding, converting the text to bytes your terminal or console will understand.

The same applies to taking input; the sys.stdin stream produces bytes, which Python can decode transparently when you use the u'£' syntax to create a Unicode object. You type the character on your keyboard, it is translated to UTF-8 bytes by the terminal or console, and written to Python to interpret.

That writing '\xc2\xa3' with print works, then, is a happy coincidence. You could take the unicode object, encode it to a different codec, and end up with garbage output:

>>> print u'£1'.encode('latin-1')
?1

My Mac terminal converted the data written for the £ sign to a ?, because the A3 byte (the Latin-1 codepoint for the pound sign) doesn't map to anything when interpreted as UTF-8.

Python determines the terminal or console codec from the locale.getpreferredencoding() function, you can observe what your terminal or console communicated it uses via the sys.stdout.encoding and sys.stdin.encoding attributes:

>>> import sys
>>> sys.stdout.encoding
'UTF-8'

Last but not least, you should not confuse printing with the representations echoed by the interpreter in interactive mode. The interpreter shows the outcome of expressions using the repr() function, a debugging tool that tries to produce Python literal notation wherever possible, using only ASCII characters. For Unicode values, that means any non-printable, non-ASCII character is reflected using escape sequences. This makes the value suitable for copying and pasting without requiring more than an ASCII-capable medium.

The repr() result of a str uses \n for newlines, for example, and \xhh hex escapes for bytes without dedicated escape sequences, outside the printable range. In addition, for unicode objects, codepoints outside the Latin-1 range are represented with \uhhhh and \Uhhhhhhhh escape sequences depending on wether or not they are part of the basic multilingual plane:

>>> u'''\
... A multiline string to show newlines
... can contain £ latin characters
... or emoji 💩!
... '''
u'A multiline string to show newlines\ncan contain \xa3 latin characters\nor emoji \U0001f4a9!\n'
>>> print _
A multiline string to show newlines
can contain £ latin characters
or emoji 💩!
like image 124
Martijn Pieters Avatar answered Oct 21 '22 06:10

Martijn Pieters