Python Unicode strings and the Python interactive interpreter

Tags:

I'm trying to understand how python 2.5 deals with unicode strings. Although by now I think I have a good grasp of how I'm supposed to handle them in code, I don't fully understand what's going on behind the scenes, particularly when you type strings at the interpreter's prompt.

So python pre 3.0 has two types for strings, namely: str (byte strings) and unicode, which are both derived from basestring. The default type for strings is str.

str objects have no notion of their actual encoding, they are just bytes. Either you've encoded a unicode string yourself and therefore know what encoding they are in, or you've read a stream of bytes whose encoding you also know beforehand (indeally). You can guess the encoding of a byte string whose encoding is unknown to you, but there just isn't a reliable way of figuring this out. Your best bet is to decode early, use unicode everywhere in your code and encode late.

That's fine. But strings typed into the interpreter are indeed encoded for you behind your back? Provided that my understanding of strings in Python is correct, what's the method/setting python uses to make this decision?

The source of my confusion is the differing results I get when I try the same thing on my system's python installation, and on my editor's embedded python console.

Click to copy

 # Editor (Sublime Text)
 >>> s = "La caña de España"
 >>> s
 'La ca\xc3\xb1a de Espa\xc3\xb1a'
 >>> s.decode("utf-8")
 u'La ca\xf1a de Espa\xf1a'
 >>> sys.getdefaultencoding()
 'ascii'

 # Windows python interpreter
 >>> s= "La caña de España"
 >>> s
 'La ca\xa4a de Espa\xa4a'
 >>> s.decode("utf-8")
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "C:\Python25\lib\encodings\utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
 UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 5: unexpected code byte
 >>> sys.getdefaultencoding()
 'ascii'

320

asked Mar 10 '10 22:03

guillermooo

2 Answers

Let me expand Ignacio's reply: In both cases there is an extra layer between Python and you: in one case it is Sublime Text and in the other it's cmd.exe. The difference in behaviour you see is not due to Python but by the different encodings used by Sublime Text (utf-8, as it seems) and cmd.exe (cp437).

So, when you type ñ, Sublime Text sends '\xc3\xb1' to Python, whereas cmd.exe sends \xa4. [I'm simplyfing here, omitting details that are not relevant to the question.].

Still, Python knows about that. From cmd.exe you'll probably get something like:

Click to copy

>>> import sys
>>> sys.stdin.encoding
'cp437'

whereas within Sublime Text you'll get something like

Click to copy

>>> import sys
>>> sys.stdin.encoding
'utf-8'

152

answered Sep 21 '22 22:09

krawyoti

The interpreter uses your command prompt's native encoding for text entry. In your case it's CP437:

Click to copy

>>> print '\xa4'.decode('cp437')
ñ

answered Sep 24 '22 22:09

Ignacio Vazquez-Abrams

Related questions
                            
                                Django : Iterate over a query set without cache
                            
                                Bubble Breaker Game Solver better than greedy?
                            
                                How to remove extended ascii using python?
                            
                                Which database should I use to store records, and how should I use it?
                            
                                Why won't my Python scatter plot work?
                            
                                How do I make django's markdown filter transform a carriage return to <br />?
                            
                                Django, how to generate an admin panel without models?
                            
                                Embed pickle (or arbitrary) data in python script
                            
                                Python Mechanize + GAEpython code
                            
                                A clean algorithm for sorting a objects according to defined dependencies?
                            
                                Efficient method to store Python dictionary on disk?
                            
                                Depth-First search in Python
                            
                                Ranking within Django ORM or SQL?
                            
                                Using code generated by Py++ as a Python extension
                            
                                Combining words in Python (permutations?)
                            
                                regex for state abbreviations (python)
                            
                                String templates in Python: what are legal characters?
                            
                                Is there any Python wrapper around cron?
                            
                                Python, dynamically invoke script
                            
                                In Django, what is the best way to manage both a mobile and desktop site?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Unicode strings and the Python interactive interpreter

Tags:

python

string

unicode

sublimetext

guillermooo

People also ask

2 Answers

krawyoti

Ignacio Vazquez-Abrams

Recent Activity

Donate For Us