I've observed the following:
>>> print '£' + '1'
£1
>>> print '£' + u'1'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
>>> print u'£' + u'1'
£1
>>> print u'£' + '1'
£1
Why does '£' + '1'
work but '£' + u'1'
doesn't work?
I looked at the types:
>>> type('£' + '1')
<type 'str'>
>>> type('£' + u'1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
>>> type(u'£' + u'1')
<type 'unicode'>
This also confuses me. If '£' + '1'
is a str
and not a unicode
, why does it print properly on my terminal? Shouldn't it print something like '\xc2\xa31'?
To add to the mix, I've also observed the following:
>>> u'£' + '1'
u'\xa31'
>>> type('1')
<type 'str'>
>>> type(u'£')
<type 'unicode'>
>>> print u'£' + '1'
£1
Why does u'£' + '1'
not print out the £
symbol properly, whereas print u'£' + '1'
does? Is it because repr
is used in the former, whereas str
is used in the latter?
Also, how come concatenation of a unicode
and a str
work in this case, but not in the '£' + u'1'
case?
You are mixing object types.
'£'
is a bytestring, containing encoded data. That those bytes happen to represent a pound sign in your terminal or console is neither here nor there, it could just as much have been a pixel in an image. You terminal or console is configured to produce and accept UTF-8 data instead, so the actual content of that bytestring is the two bytes C2 and A3, when expresed in hexadecimal.
u'1'
on the other hand is a Unicode string. It is unambiguously text data. If you want to concatenate other data to it, it too should be Unicode. Python 2 then will automatically decode str
bytes to Unicode using the default ASCII codec if you try to do this.
However, the '£'
bytestring is not decodable as ASCII. It can be decoded as UTF-8; decode the bytes explicitly, since we know the correct codec here:
print '£'.decode('utf8') + u'1'
When writing bytes to the terminal or console, it is your terminal or console that interprets the bytes and makes sense of them. If you write a unicode
object to the terminal, the sys.stdout
object takes care of encoding, converting the text to bytes your terminal or console will understand.
The same applies to taking input; the sys.stdin
stream produces bytes, which Python can decode transparently when you use the u'£'
syntax to create a Unicode object. You type the character on your keyboard, it is translated to UTF-8 bytes by the terminal or console, and written to Python to interpret.
That writing '\xc2\xa3'
with print
works, then, is a happy coincidence. You could take the unicode
object, encode it to a different codec, and end up with garbage output:
>>> print u'£1'.encode('latin-1')
?1
My Mac terminal converted the data written for the £
sign to a ?
, because the A3 byte (the Latin-1 codepoint for the pound sign) doesn't map to anything when interpreted as UTF-8.
Python determines the terminal or console codec from the locale.getpreferredencoding()
function, you can observe what your terminal or console communicated it uses via the sys.stdout.encoding
and sys.stdin.encoding
attributes:
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
Last but not least, you should not confuse printing with the representations echoed by the interpreter in interactive mode. The interpreter shows the outcome of expressions using the repr()
function, a debugging tool that tries to produce Python literal notation wherever possible, using only ASCII characters. For Unicode values, that means any non-printable, non-ASCII character is reflected using escape sequences. This makes the value suitable for copying and pasting without requiring more than an ASCII-capable medium.
The repr()
result of a str
uses \n
for newlines, for example, and \xhh
hex escapes for bytes without dedicated escape sequences, outside the printable range. In addition, for unicode
objects, codepoints outside the Latin-1 range are represented with \uhhhh
and \Uhhhhhhhh
escape sequences depending on wether or not they are part of the basic multilingual plane:
>>> u'''\
... A multiline string to show newlines
... can contain £ latin characters
... or emoji 💩!
... '''
u'A multiline string to show newlines\ncan contain \xa3 latin characters\nor emoji \U0001f4a9!\n'
>>> print _
A multiline string to show newlines
can contain £ latin characters
or emoji 💩!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With