Consider the next example: <pre class="prettyprint"><code>>>> s = u"баба" >>> s u'\xe1\xe0\xe1\xe0' >>> print s áàáà </code></pre> I'm using <code>cp1251</code> encoding within the idle, but it seems like the interpreter actually uses <code>latin1</code> to create unicode string: <pre class="prettyprint"><code>>>> print s.encode('latin1') баба </code></pre> Why so? Is there spec for such behavior? CPython, 2.7. <hr> Edit The code I was actually looking for is <pre class="prettyprint"><code>>>> u'\xe1\xe0\xe1\xe0' == u'\u00e1\u00e0\u00e1\u00e0' True </code></pre> Seems like when encoding unicode with <code>latin1</code> codec, all unicode points less that 256 are simply left as is thus resulting in bytes which I typed in before.

When you type a character such as <code>б</code> into the terminal, you see a <code>б</code>, but what is really inputted is a sequence of bytes. Since your terminal encoding is <code>cp1251</code>, typing <code>баба</code> results in the sequence of bytes equal to the unicode <code>баба</code> encoded in <code>cp1251</code>: <pre class="prettyprint"><code>In [219]: "баба".decode('utf-8').encode('cp1251') Out[219]: '\xe1\xe0\xe1\xe0' </code></pre> (Note I use <code>utf-8</code> above because my terminal encoding is <code>utf-8</code>, not <code>cp1251</code>. For me, <code>"баба".decode('utf-8')</code> is just unicode for <code>баба</code>.) Since typing <code>баба</code> results in the sequence of bytes <code>\xe1\xe0\xe1\xe0</code>, when you type <code>u"баба"</code> into the terminal, Python receives <code>u'\xe1\xe0\xe1\xe0'</code> instead. This is why you are seeing <pre class="prettyprint"><code>>>> s u'\xe1\xe0\xe1\xe0' </code></pre> This unicode happens to represent <code>áàáà</code>. And when you type <pre class="prettyprint"><code>>>> print s.encode('latin1') </code></pre> the <code>latin1</code> encoding converts <code>u'\xe1\xe0\xe1\xe0'</code> to <code>'\xe1\xe0\xe1\xe0'</code>. The terminal receives the sequence of bytes <code>'\xe1\xe0\xe1\xe0'</code>, and decodes them with <code>cp1251</code>, thus printing <code>баба</code>: <pre class="prettyprint"><code>In [222]: print('\xe1\xe0\xe1\xe0'.decode('cp1251')) баба </code></pre> Try: <pre class="prettyprint"><code>>>> s = "баба" </code></pre> (without the <code>u</code>) instead. Or, <pre class="prettyprint"><code>>>> s = "баба".decode('cp1251') </code></pre> to make <code>s</code> <code>unicode</code>. Or, use the verbose but very explicit (and terminal-encoding agnostic): <pre class="prettyprint"><code>>>> s = u'\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}' </code></pre> Or the short but less-readily comprehensible <pre class="prettyprint"><code>>>> s = u'\u0431\u0430\u0431\u0430' </code></pre>

Encoding used for u"" literals

Tags:

python

encoding

unicode

Consider the next example:

>>> s = u"баба"
>>> s
u'\xe1\xe0\xe1\xe0'
>>> print s
áàáà

I'm using cp1251 encoding within the idle, but it seems like the interpreter actually uses latin1 to create unicode string:

>>> print s.encode('latin1')
баба

Why so? Is there spec for such behavior?

CPython, 2.7.

Edit

The code I was actually looking for is

>>> u'\xe1\xe0\xe1\xe0' == u'\u00e1\u00e0\u00e1\u00e0'
True

Seems like when encoding unicode with latin1 codec, all unicode points less that 256 are simply left as is thus resulting in bytes which I typed in before.

261

asked Jan 15 '12 19:01

Roman Bodnarchuk

1 Answers

When you type a character such as б into the terminal, you see a б, but what is really inputted is a sequence of bytes.

Since your terminal encoding is cp1251, typing баба results in the sequence of bytes equal to the unicode баба encoded in cp1251:

In [219]: "баба".decode('utf-8').encode('cp1251')
Out[219]: '\xe1\xe0\xe1\xe0'

(Note I use utf-8 above because my terminal encoding is utf-8, not cp1251. For me, "баба".decode('utf-8') is just unicode for баба.)

Since typing баба results in the sequence of bytes \xe1\xe0\xe1\xe0, when you type u"баба" into the terminal, Python receives u'\xe1\xe0\xe1\xe0' instead. This is why you are seeing

>>> s
u'\xe1\xe0\xe1\xe0'

This unicode happens to represent áàáà.

And when you type

>>> print s.encode('latin1')

the latin1 encoding converts u'\xe1\xe0\xe1\xe0' to '\xe1\xe0\xe1\xe0'. The terminal receives the sequence of bytes '\xe1\xe0\xe1\xe0', and decodes them with cp1251, thus printing баба:

In [222]: print('\xe1\xe0\xe1\xe0'.decode('cp1251'))
баба

Try:

>>> s = "баба"

(without the u) instead. Or,

>>> s = "баба".decode('cp1251')

to make s unicode. Or, use the verbose but very explicit (and terminal-encoding agnostic):

>>> s = u'\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}'

Or the short but less-readily comprehensible

>>> s = u'\u0431\u0430\u0431\u0430'

answered Oct 22 '22 19:10

unutbu

Related questions
                            
                                Flatten OpenCV/Numpy Array
                            
                                Inverting Dictionaries in Python
                            
                                Deploying Django on an apache server
                            
                                Greedy match with negative lookahead in a regular expression
                            
                                Why is putting the module level code into a function and then calling the function is faster in Python?
                            
                                Scapy: Adding new protocol with complex field groupings
                            
                                Python. Doing some work on background with Gtk GUI
                            
                                Clip an image using several patches in matplotlib
                            
                                How do you make the linewidth of a single line change as a function of x in matplotlib?
                            
                                How to pass class's self through a flask.Blueprint.route decorator?
                            
                                Python naming conventions for functions that do modify the object or return a modified copy
                            
                                matplotlib plt.show() only selected objects
                            
                                Connect python to oracle
                            
                                Inverted fancy indexing
                            
                                How to use Pageant with Paramiko on Windows?
                            
                                How to avoid the "This message may not have been sent by" warning when sending email using Google App Engine?
                            
                                Python: Catching the output from subprocess.call with stdout
                            
                                Scripting library for monitoring server health?
                            
                                Python - how can I change default path when installing modules?
                            
                                Does the inaccessible `.0` variable in `locals()` affect memory or performance?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With