# -*- coding: utf-8 -*-
a = 'éáűőúöüó€'
print type(a) # <type 'str'>
print a # éáűőúöüó€
print ord(a[-1]) # 172
Why is this working ? Shouldn't be this a SyntaxError: Non-ASCII character '\xc3' in file ...
? There are unicode literals in the string.
When I prefix it with u
, the results are different:
# -*- coding: utf-8 -*-
a = u'éáűőúöüó€'
print type(a) # <type 'unicode'>
print a # éáűőúöüó€
print ord(a[-1]) # 8364
Why? What is the difference between the internal representations in python ? How can I see it myself ? :)
There are unicode literals in the string
No, there are not. There are bytes in the string. Python simply goes with the bytes your editor saved to disk when you created the file.
When you prefixed the string with a u''
, you signalled to python that you are creating a unicode
object instead. Python now pays attention to the encoding you specified at the top of your source file, and it decodes the bytes in the source file to a unicode
object based on the encoding you specified.
In both cases, your editor saved a series of bytes to a file, for the €
character, the UTF-8 encoding is three bytes, represented in hexadecimal as E282AC. The last byte in the bytestring is thus AC, or 172 in decimal. Once you decode the last 3 bytes as UTF-8, they together become the Unicode codepoint U+20AC, which is 8364 in decimal.
You really should read up on Python and Unicode:
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With