Consider the next example:
>>> s = u"баба"
>>> s
u'\xe1\xe0\xe1\xe0'
>>> print s
áàáà
I'm using cp1251
encoding within the idle, but it seems like the interpreter actually uses latin1
to create unicode string:
>>> print s.encode('latin1')
баба
Why so? Is there spec for such behavior?
CPython, 2.7.
Edit
The code I was actually looking for is
>>> u'\xe1\xe0\xe1\xe0' == u'\u00e1\u00e0\u00e1\u00e0'
True
Seems like when encoding unicode with latin1
codec, all unicode points less that 256 are simply left as is thus resulting in bytes which I typed in before.
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.
If the character string literal has a prefix of N, the literal is treated as a Unicode string. When the N prefix is used, the characters in the literal are read as WCHAR characters. Any string literal with non-ASCII characters is treated as a Unicode literal by default.
The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).
Unicode text can be encoded in various formats: The two most important ones are UTF-8 and UTF-16. In C++ Windows code there's often a need to convert between UTF-8 and UTF-16, because Unicode-enabled Win32 APIs use UTF-16 as their native Unicode encoding.
A UTF-8 encoded string is a u8-prefixed, double-quote delimited, null-terminated array of type const char [n], where n is the length of the encoded array in bytes. A u8-prefixed string literal may contain any graphic character except the double quotation mark ( " ), backslash ( \ ), or newline character.
The value of a UTF-8 character literal containing a single character, escape sequence, or universal character name has a value equal to its ISO 10646 code point value if it can be represented by a single UTF-8 code unit (corresponding to the C0 Controls and Basic Latin Unicode block).
A string literal represents a sequence of characters that together form a null-terminated string. The characters must be enclosed between double quotation marks. There are the following kinds of string literals:
Without the from __future__ import unicode_literals line, you are building a byte string that holds UTF-8 encoded bytes. With the string you are building a unicode string. print has to treat these two values differently; a byte string is written to sys.stdout unchanged.
When you type a character such as б
into the terminal, you see a б
, but what is really inputted is a sequence of bytes.
Since your terminal encoding is cp1251
, typing баба
results in the sequence of bytes equal to the unicode баба
encoded in cp1251
:
In [219]: "баба".decode('utf-8').encode('cp1251')
Out[219]: '\xe1\xe0\xe1\xe0'
(Note I use utf-8
above because my terminal encoding is utf-8
, not cp1251
. For me, "баба".decode('utf-8')
is just unicode for баба
.)
Since typing баба
results in the sequence of bytes \xe1\xe0\xe1\xe0
, when you type u"баба"
into the terminal, Python receives u'\xe1\xe0\xe1\xe0'
instead. This is why you are seeing
>>> s
u'\xe1\xe0\xe1\xe0'
This unicode happens to represent áàáà
.
And when you type
>>> print s.encode('latin1')
the latin1
encoding converts u'\xe1\xe0\xe1\xe0'
to '\xe1\xe0\xe1\xe0'
.
The terminal receives the sequence of bytes '\xe1\xe0\xe1\xe0'
, and decodes them with cp1251
, thus printing баба
:
In [222]: print('\xe1\xe0\xe1\xe0'.decode('cp1251'))
баба
Try:
>>> s = "баба"
(without the u
) instead. Or,
>>> s = "баба".decode('cp1251')
to make s
unicode
. Or, use the verbose but very explicit (and terminal-encoding agnostic):
>>> s = u'\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}'
Or the short but less-readily comprehensible
>>> s = u'\u0431\u0430\u0431\u0430'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With