Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding used for u"" literals

Consider the next example:

>>> s = u"баба"
>>> s
u'\xe1\xe0\xe1\xe0'
>>> print s
áàáà

I'm using cp1251 encoding within the idle, but it seems like the interpreter actually uses latin1 to create unicode string:

>>> print s.encode('latin1')
баба

Why so? Is there spec for such behavior?

CPython, 2.7.


Edit

The code I was actually looking for is

>>> u'\xe1\xe0\xe1\xe0' == u'\u00e1\u00e0\u00e1\u00e0'
True

Seems like when encoding unicode with latin1 codec, all unicode points less that 256 are simply left as is thus resulting in bytes which I typed in before.

like image 261
Roman Bodnarchuk Avatar asked Jan 15 '12 19:01

Roman Bodnarchuk


People also ask

What is encoding UTF-8 in Python?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.

What are Unicode literals?

If the character string literal has a prefix of N, the literal is treated as a Unicode string. When the N prefix is used, the characters in the literal are read as WCHAR characters. Any string literal with non-ASCII characters is treated as a Unicode literal by default.

Is UTF-8 the same as Unicode?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

What encoding does C++ use?

Unicode text can be encoded in various formats: The two most important ones are UTF-8 and UTF-16. In C++ Windows code there's often a need to convert between UTF-8 and UTF-16, because Unicode-enabled Win32 APIs use UTF-16 as their native Unicode encoding.

What is a UTF-8 encoded string?

A UTF-8 encoded string is a u8-prefixed, double-quote delimited, null-terminated array of type const char [n], where n is the length of the encoded array in bytes. A u8-prefixed string literal may contain any graphic character except the double quotation mark ( " ), backslash ( \ ), or newline character.

What is the value of a UTF 8 character literal?

The value of a UTF-8 character literal containing a single character, escape sequence, or universal character name has a value equal to its ISO 10646 code point value if it can be represented by a single UTF-8 code unit (corresponding to the C0 Controls and Basic Latin Unicode block).

What is a string literal?

A string literal represents a sequence of characters that together form a null-terminated string. The characters must be enclosed between double quotation marks. There are the following kinds of string literals:

What is the difference between Unicode_literals and byte string?

Without the from __future__ import unicode_literals line, you are building a byte string that holds UTF-8 encoded bytes. With the string you are building a unicode string. print has to treat these two values differently; a byte string is written to sys.stdout unchanged.


1 Answers

When you type a character such as б into the terminal, you see a б, but what is really inputted is a sequence of bytes.

Since your terminal encoding is cp1251, typing баба results in the sequence of bytes equal to the unicode баба encoded in cp1251:

In [219]: "баба".decode('utf-8').encode('cp1251')
Out[219]: '\xe1\xe0\xe1\xe0'

(Note I use utf-8 above because my terminal encoding is utf-8, not cp1251. For me, "баба".decode('utf-8') is just unicode for баба.)

Since typing баба results in the sequence of bytes \xe1\xe0\xe1\xe0, when you type u"баба" into the terminal, Python receives u'\xe1\xe0\xe1\xe0' instead. This is why you are seeing

>>> s
u'\xe1\xe0\xe1\xe0'

This unicode happens to represent áàáà.

And when you type

>>> print s.encode('latin1')

the latin1 encoding converts u'\xe1\xe0\xe1\xe0' to '\xe1\xe0\xe1\xe0'. The terminal receives the sequence of bytes '\xe1\xe0\xe1\xe0', and decodes them with cp1251, thus printing баба:

In [222]: print('\xe1\xe0\xe1\xe0'.decode('cp1251'))
баба

Try:

>>> s = "баба"

(without the u) instead. Or,

>>> s = "баба".decode('cp1251')

to make s unicode. Or, use the verbose but very explicit (and terminal-encoding agnostic):

>>> s = u'\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}'

Or the short but less-readily comprehensible

>>> s = u'\u0431\u0430\u0431\u0430'
like image 81
unutbu Avatar answered Oct 22 '22 19:10

unutbu