Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the Python compiler preprocess the source file with the declared encoding?

Let's say I have a Python 3 source file in cp1251 encoding with the following content:

# эюяьъ (some Russian comment)
print('Hehehey')

If I run the file, I'll get this:

SyntaxError: Non-UTF-8 code starting with '\xfd' in file ... on line 1 but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

That's clear and expected - I understand that, in general, cp1251 byte sequence can't be decoded with UTF-8, which is a default encoding in Python 3.

But if I edit the file as follows:

# coding: utf-8
# эюяьъ (some Russian comment)
print('Hehehey')  

everything will work fine.

And that is pretty confusing.
In the 2nd example I still have in the source the same cp1251 byte sequence, which is not valid in UTF-8, and I expect the compiler should use the same encoding (UTF-8) for preprocessing the file and terminate with the same error.
I have read PEP 263 but still don't get the reason it doesn't happen.

So, why my code works in the 2nd case and terminates in the 1st?


UPD.

In order to check whether my text editor is smart enough to change the file's encoding because of the line # coding: utf-8, let's look at the actual bytes:

(1st example)

23 20 fd fe ff fa fc ...

(2nd example)

23 20 63 6f 64 69 6e 67 3a 20 75 74 66 2d 38 0a
23 20 fd fe ff fa fc ...

These f-bytes are for cyrillic letters in cp1251 and they are not valid in UTF-8.

Furhermore, if I edit the source this way:

# coding: utf-8
# эюяъь (some Russian comment)
print('Hehehey')
print('эюяъь')

I'll face the error:

SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xfd ...

So, unfortunately my text editor isn't so smart.
Thus, in the above examples the source file is not converted from cp1251 to UTF-8.

like image 234
MaximTitarenko Avatar asked Oct 16 '17 22:10

MaximTitarenko


1 Answers

This seems to be a quirk of how the strict behavior for the default encoding is enforced. In the tokenizer function, decoding_gets, if it didn't find an explicit encoding declaration yet (tok->encoding is still NULL), it does a character by character check of the line for invalid UTF-8 characters and pops the SyntaxError you're seeing that references PEP 263.

But if an encoding has been specified, check_coding_spec will have defined tok->encoding, and that default encoding strict test is bypassed completely; it isn't replaced with a test for the declared encoding.

Normally, this would cause problems when the code is actually being parsed, but it looks like comments are handled in a stripped down way: As soon as the comment character, #, is recognized, the tokenizer just grabs and discards characters until it sees a newline or EOF, it's not trying to do anything with them at all (which makes sense; parsing comments is just wasting time that could be spent on stuff that actually runs).

Thus, the behavior you observe: An encoding declaration disables the strict file-wide character by character checking for valid UTF-8 that is applied when no encoding is declared explicitly, and comments are special-cased so that their contents are ignored, allowing garbage bytes in the comments to escape detection.

like image 188
ShadowRanger Avatar answered Sep 20 '22 00:09

ShadowRanger