Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Byte string literal with non-ascii characters

Apparently, I can do that in Python 2.7:

value = '國華'

It seems Python is using an encoding to encode the characters in the string literal to a byte string. What is that encoding? Is that the encoding defined in sys.getdefaultencoding(), the encoding of the source file, or something else?

Thanks

like image 644
Flavien Avatar asked Aug 16 '12 18:08

Flavien


1 Answers

getdefaultencoding has no relation to the encoding of the source file or the terminal. It is the encoding used to convert byte strings implicitly to Unicode strings and should always be 'ascii' on Python 2.X ('utf8' on Python 3.X).

On Python 2.X, your line of code in a script with no encoding declared produces an error:

SyntaxError: Non-ASCII character '\x87' in file ...

The actual non-ASCII character may vary, but it won't work without an encoding declaration. An encoding declaration is required to use non-ASCII characters on Python 2.X. The encoding declaration must match the source file encoding. For example:

# coding: utf8
value = '國華'

when saved as cp936 produces:

SyntaxError: 'utf8' codec can't decode byte 0x87 in position 9: invalid start byte

When the encoding is correct, the bytes in the byte string are literally what is in the source file, so it will contain the encoded bytes of the characters. When Python parses a Unicode string the bytes are decoded using the declared source encoding to Unicode. Note the difference when printing a UTF-8 byte string and a Unicode string on a cp936 console:

# coding: utf8
value = '國華'
print value,repr(value)
value = u'國華'
print value,repr(value)

Output:

鍦嬭彲 '\xe5\x9c\x8b\xe8\x8f\xaf'
國華 u'\u570b\u83ef'

The byte string contains the 3-byte UTF-8 encodings of the two characters, but displayed incorrectly since the byte sequence isn't understood by a cp936 terminal. Unicode is printed correctly, and the string contains the Unicode code points decoded from the UTF-8 bytes of the source file.

Note the difference when declaring and using the encoding that matches the terminal:

# coding: cp936
value = '國華'
print value,repr(value)
value = u'國華'
print value,repr(value)

Output:

國華 '\x87\xf8\xc8A'
國華 u'\u570b\u83ef'

The content of the byte string is now the 2-byte cp936 encodings of the two characters ('A' equivalent to '\x41') and is displayed correctly since the terminal understands the cp936 byte sequence. The Unicode string contains the same Unicode code points for the two characters as the previous example because the source byte sequence was decoded using the declared source encoding to Unicode.

If a script has a correct source encoding declaration and uses Unicode strings for text, it will display the correct characters1 regardless of terminal encoding2. It will throw a UnicodeEncodeError if the terminal doesn't support the character rather than display the wrong character.

A final note: Python 2.X defaults to 'ascii' encoding unless declared otherwise and allows non-ASCII characters in byte strings if the encoding supports them. Python 3.X defaults to 'utf8' encoding (so make sure to save in that encoding or declare otherwise), and does not allow non-ASCII characters in byte strings.

1If the terminal font supports the character.
2If the terminal encoding supports the character.

like image 103
Mark Tolonen Avatar answered Oct 02 '22 04:10

Mark Tolonen