PEP-263 specifies that encoding specified in the source is applied in the following order:
read the file
decode it into Unicode assuming a fixed per-file encoding
convert it into a UTF-8 byte string
tokenize the UTF-8 content
compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding
So, if I take this code:
print 'abcdefgh'
print u'abcdefgh'
And convert it to ROT-13:
# coding: rot13
cevag 'nopqrstu'
cevag h'nopqrstu'
I would expect that it is first decoded and then becomes identical to the original, printing:
abcdefgh
abcdefgh
But instead, it prints:
nopqrstu
abcdefgh
So, the unicode literal works as expeced, but str remains unconverted. Why?
Eliminating some possibilities:
I confirmed that the problem is not in a later phase (printing to console), but immediately at parsing, becuase this code produces "ValueError: unsupported format character 'q' (0x71) at index 1":
x = '%q' % 1 # that is %d !
I guess the last point actually explains what happens quite accurately:
- compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding
After the first 4 steps, the contents of the source file are a tokenized unicode version of the following string:
print 'abcdefgh'
print u'abcdefgh'
After that, in step 5, the string object 'abcdefgh' is reencoded into 8-bit string data using the given file encoding (which is rot13), so the contents become:
print 'nopqrstu'
print u'abcdefgh'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With