Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Unicode strings and the Python interactive interpreter

I'm trying to understand how python 2.5 deals with unicode strings. Although by now I think I have a good grasp of how I'm supposed to handle them in code, I don't fully understand what's going on behind the scenes, particularly when you type strings at the interpreter's prompt.

So python pre 3.0 has two types for strings, namely: str (byte strings) and unicode, which are both derived from basestring. The default type for strings is str.

str objects have no notion of their actual encoding, they are just bytes. Either you've encoded a unicode string yourself and therefore know what encoding they are in, or you've read a stream of bytes whose encoding you also know beforehand (indeally). You can guess the encoding of a byte string whose encoding is unknown to you, but there just isn't a reliable way of figuring this out. Your best bet is to decode early, use unicode everywhere in your code and encode late.

That's fine. But strings typed into the interpreter are indeed encoded for you behind your back? Provided that my understanding of strings in Python is correct, what's the method/setting python uses to make this decision?

The source of my confusion is the differing results I get when I try the same thing on my system's python installation, and on my editor's embedded python console.

 # Editor (Sublime Text)
 >>> s = "La caña de España"
 >>> s
 'La ca\xc3\xb1a de Espa\xc3\xb1a'
 >>> s.decode("utf-8")
 u'La ca\xf1a de Espa\xf1a'
 >>> sys.getdefaultencoding()
 'ascii'

 # Windows python interpreter
 >>> s= "La caña de España"
 >>> s
 'La ca\xa4a de Espa\xa4a'
 >>> s.decode("utf-8")
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "C:\Python25\lib\encodings\utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
 UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 5: unexpected code byte
 >>> sys.getdefaultencoding()
 'ascii'
like image 320
guillermooo Avatar asked Mar 10 '10 22:03

guillermooo


People also ask

What is Unicode string in Python?

To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.

What is the difference between Unicode and string in Python?

Python has two different datatypes. One is 'unicode' and other is 'str'. Type 'unicode' is meant for working with codepoints of characters. Type 'str' is meant for working with encoded binary representation of characters.

What is interactive interpreter in Python?

The Python interactive console (also called the Python interpreter or Python shell) provides programmers with a quick way to execute commands and try out or test code without creating a file.

How do I get Unicode of a string in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.


2 Answers

Let me expand Ignacio's reply: In both cases there is an extra layer between Python and you: in one case it is Sublime Text and in the other it's cmd.exe. The difference in behaviour you see is not due to Python but by the different encodings used by Sublime Text (utf-8, as it seems) and cmd.exe (cp437).

So, when you type ñ, Sublime Text sends '\xc3\xb1' to Python, whereas cmd.exe sends \xa4. [I'm simplyfing here, omitting details that are not relevant to the question.].

Still, Python knows about that. From cmd.exe you'll probably get something like:

>>> import sys
>>> sys.stdin.encoding
'cp437'

whereas within Sublime Text you'll get something like

>>> import sys
>>> sys.stdin.encoding
'utf-8'
like image 152
krawyoti Avatar answered Sep 21 '22 22:09

krawyoti


The interpreter uses your command prompt's native encoding for text entry. In your case it's CP437:

>>> print '\xa4'.decode('cp437')
ñ
like image 32
Ignacio Vazquez-Abrams Avatar answered Sep 24 '22 22:09

Ignacio Vazquez-Abrams